[jira] [Commented] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.
[ https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960371#comment-15960371 ] Apache Spark commented on SPARK-20248: -- User 'shaolinliu' has created a pull request for this issue: https://github.com/apache/spark/pull/17561 > Spark SQL add limit parameter to enhance the reliability. > - > > Key: SPARK-20248 > URL: https://issues.apache.org/jira/browse/SPARK-20248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: 2.1.0 >Reporter: shaolinliu >Priority: Minor > > When we using thrift server, it is difficult to constrain the user's sql > statement; > When the user query a large table without limit, this will lead to thrift > server process memory occupancy lead to service instability; > In general, the user is not used correctly, because if you really need to > return the whole table: > 1, if you use this date to compute , you can complete the computation in > the cluster and then return > 2, if you want obtain the data, you can store it in hdfs > For the above scene, it is recommended to add a > "spark.sql.thriftServer.retainedResults" parameter, > 1, when it is 0, we don not restrict user's operation > 2, when it is greater than 0, if user query with limit, we use user's > limit;if not we use this to limit query's result > Priority user's limit is because, if the user consider the limit, in > general, the user is aware of the exact meaning of this query -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.
[ https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960335#comment-15960335 ] Apache Spark commented on SPARK-20248: -- User 'shaolinliu' has created a pull request for this issue: https://github.com/apache/spark/pull/17560 > Spark SQL add limit parameter to enhance the reliability. > - > > Key: SPARK-20248 > URL: https://issues.apache.org/jira/browse/SPARK-20248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: 2.1.0 >Reporter: shaolinliu >Priority: Minor > > When we using thrift server, it is difficult to constrain the user's sql > statement; > When the user query a large table without limit, this will lead to thrift > server process memory occupancy lead to service instability; > In general, the user is not used correctly, because if you really need to > return the whole table: > 1, if you use this date to compute , you can complete the computation in > the cluster and then return > 2, if you want obtain the data, you can store it in hdfs > For the above scene, it is recommended to add a > "spark.sql.thriftServer.retainedResults" parameter, > 1, when it is 0, we don not restrict user's operation > 2, when it is greater than 0, if user query with limit, we use user's > limit;if not we use this to limit query's result > Priority user's limit is because, if the user consider the limit, in > general, the user is aware of the exact meaning of this query -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.
[ https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20248: Assignee: Apache Spark > Spark SQL add limit parameter to enhance the reliability. > - > > Key: SPARK-20248 > URL: https://issues.apache.org/jira/browse/SPARK-20248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: 2.1.0 >Reporter: shaolinliu >Assignee: Apache Spark >Priority: Minor > > When we using thrift server, it is difficult to constrain the user's sql > statement; > When the user query a large table without limit, this will lead to thrift > server process memory occupancy lead to service instability; > In general, the user is not used correctly, because if you really need to > return the whole table: > 1, if you use this date to compute , you can complete the computation in > the cluster and then return > 2, if you want obtain the data, you can store it in hdfs > For the above scene, it is recommended to add a > "spark.sql.thriftServer.retainedResults" parameter, > 1, when it is 0, we don not restrict user's operation > 2, when it is greater than 0, if user query with limit, we use user's > limit;if not we use this to limit query's result > Priority user's limit is because, if the user consider the limit, in > general, the user is aware of the exact meaning of this query -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.
[ https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20248: Assignee: (was: Apache Spark) > Spark SQL add limit parameter to enhance the reliability. > - > > Key: SPARK-20248 > URL: https://issues.apache.org/jira/browse/SPARK-20248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: 2.1.0 >Reporter: shaolinliu >Priority: Minor > > When we using thrift server, it is difficult to constrain the user's sql > statement; > When the user query a large table without limit, this will lead to thrift > server process memory occupancy lead to service instability; > In general, the user is not used correctly, because if you really need to > return the whole table: > 1, if you use this date to compute , you can complete the computation in > the cluster and then return > 2, if you want obtain the data, you can store it in hdfs > For the above scene, it is recommended to add a > "spark.sql.thriftServer.retainedResults" parameter, > 1, when it is 0, we don not restrict user's operation > 2, when it is greater than 0, if user query with limit, we use user's > limit;if not we use this to limit query's result > Priority user's limit is because, if the user consider the limit, in > general, the user is aware of the exact meaning of this query -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20246: Assignee: Apache Spark > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren >Assignee: Apache Spark > Labels: correctness > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960329#comment-15960329 ] Apache Spark commented on SPARK-20246: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17559 > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > Labels: correctness > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20246: Assignee: (was: Apache Spark) > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > Labels: correctness > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960325#comment-15960325 ] Liang-Chi Hsieh commented on SPARK-20246: - For union and window, we don't have the replacement of expression. So I think they should be safe. > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > Labels: correctness > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.
shaolinliu created SPARK-20248: -- Summary: Spark SQL add limit parameter to enhance the reliability. Key: SPARK-20248 URL: https://issues.apache.org/jira/browse/SPARK-20248 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Environment: 2.1.0 Reporter: shaolinliu Priority: Minor When we using thrift server, it is difficult to constrain the user's sql statement; When the user query a large table without limit, this will lead to thrift server process memory occupancy lead to service instability; In general, the user is not used correctly, because if you really need to return the whole table: 1, if you use this date to compute , you can complete the computation in the cluster and then return 2, if you want obtain the data, you can store it in hdfs For the above scene, it is recommended to add a "spark.sql.thriftServer.retainedResults" parameter, 1, when it is 0, we don not restrict user's operation 2, when it is greater than 0, if user query with limit, we use user's limit;if not we use this to limit query's result Priority user's limit is because, if the user consider the limit, in general, the user is aware of the exact meaning of this query -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960300#comment-15960300 ] Liang-Chi Hsieh commented on SPARK-20246: - We should also check determinism of the [replaced expression|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L810]. > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > Labels: correctness > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribut
[ https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960291#comment-15960291 ] 翟玉勇 commented on SPARK-20241: - i am sorry, it is my code added in spark cause this WARN,i change thread classloader in HiveSessionState but a throw exception cause not reset classloader back > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > > > Key: SPARK-20241 > URL: https://issues.apache.org/jira/browse/SPARK-20241 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: 翟玉勇 >Priority: Minor > > do some sql operate for example :insert into table mytable select * from > othertable > there are WARN log: > {code} > 17/04/06 17:04:55 WARN > [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- > main]: Error calculating stats of compiled class. > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164) > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerato
[jira] [Closed] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribute
[ https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 翟玉勇 closed SPARK-20241. --- Resolution: Not A Bug > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > > > Key: SPARK-20241 > URL: https://issues.apache.org/jira/browse/SPARK-20241 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: 翟玉勇 >Priority: Minor > > do some sql operate for example :insert into table mytable select * from > othertable > there are WARN log: > {code} > 17/04/06 17:04:55 WARN > [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- > main]: Error calculating stats of compiled class. > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164) > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:357) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1
[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960287#comment-15960287 ] Liang-Chi Hsieh commented on SPARK-20246: - Seems we have checked determinism in [here|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L805]. > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > Labels: correctness > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001 ] Liang-Chi Hsieh edited comment on SPARK-20226 at 4/7/17 5:19 AM: - {{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be set through SQLConf. I am afraid that the configs in local.conf only cover Spark configuration via SparkConf. Can you explicitly set this flag in your application through SQLConf? was (Author: viirya): {{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be set through SQLConf. I am afraid that the configs in local.conf only covers Spark configuration via SparkConf. Can you explicitly set this flag in your application through SQLConf? > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons > Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputC
[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001 ] Liang-Chi Hsieh edited comment on SPARK-20226 at 4/7/17 5:19 AM: - {{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be set through SQLConf. I am afraid that the configs in local.conf only covers Spark configuration via SparkConf. Can you explicitly set this flag in your application through SQLConf? was (Author: viirya): {{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be set through SQLConf. I am not sure if your local.conf only covers Spark configuration via SparkConf. Can you explicitly set this flag in your application through SQLConf? > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons > Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summ
[jira] [Issue Comment Deleted] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar
[ https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20247: Comment: was deleted (was: I will create a PR later.) > Add jar but this jar is missing later shouldn't affect jobs that doesn't use > this jar > - > > Key: SPARK-20247 > URL: https://issues.apache.org/jira/browse/SPARK-20247 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > The stack trace: > {noformat} > 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID > 199) > java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found. > at > org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200 > {noformat} > How to reporduce: > {code:java} > test("add jar but this jar is missing later") { > val tmpDir = Utils.createTempDir() > val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir) > sc = new SparkContext(new > SparkConf().setAppName("test").setMaster("local")) > sc.addJar(tmpJar.getAbsolutePath) > FileUtils.deleteQuietly(tmpJar) > assert (sc.parallelize(Array(1, 2, 3)).count === 3) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For ad
[jira] [Assigned] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar
[ https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20247: Assignee: Apache Spark > Add jar but this jar is missing later shouldn't affect jobs that doesn't use > this jar > - > > Key: SPARK-20247 > URL: https://issues.apache.org/jira/browse/SPARK-20247 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark > > The stack trace: > {noformat} > 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID > 199) > java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found. > at > org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200 > {noformat} > How to reporduce: > {code:java} > test("add jar but this jar is missing later") { > val tmpDir = Utils.createTempDir() > val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir) > sc = new SparkContext(new > SparkConf().setAppName("test").setMaster("local")) > sc.addJar(tmpJar.getAbsolutePath) > FileUtils.deleteQuietly(tmpJar) > assert (sc.parallelize(Array(1, 2, 3)).count === 3) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apac
[jira] [Assigned] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar
[ https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20247: Assignee: (was: Apache Spark) > Add jar but this jar is missing later shouldn't affect jobs that doesn't use > this jar > - > > Key: SPARK-20247 > URL: https://issues.apache.org/jira/browse/SPARK-20247 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > The stack trace: > {noformat} > 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID > 199) > java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found. > at > org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200 > {noformat} > How to reporduce: > {code:java} > test("add jar but this jar is missing later") { > val tmpDir = Utils.createTempDir() > val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir) > sc = new SparkContext(new > SparkConf().setAppName("test").setMaster("local")) > sc.addJar(tmpJar.getAbsolutePath) > FileUtils.deleteQuietly(tmpJar) > assert (sc.parallelize(Array(1, 2, 3)).count === 3) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional com
[jira] [Commented] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar
[ https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960279#comment-15960279 ] Apache Spark commented on SPARK-20247: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/17558 > Add jar but this jar is missing later shouldn't affect jobs that doesn't use > this jar > - > > Key: SPARK-20247 > URL: https://issues.apache.org/jira/browse/SPARK-20247 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > The stack trace: > {noformat} > 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID > 199) > java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found. > at > org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200 > {noformat} > How to reporduce: > {code:java} > test("add jar but this jar is missing later") { > val tmpDir = Utils.createTempDir() > val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir) > sc = new SparkContext(new > SparkConf().setAppName("test").setMaster("local")) > sc.addJar(tmpJar.getAbsolutePath) > FileUtils.deleteQuietly(tmpJar) > assert (sc.parallelize(Array(1, 2, 3)).count === 3) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --
[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20208: Assignee: (was: Apache Spark) > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960278#comment-15960278 ] Apache Spark commented on SPARK-20208: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17557 > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20208: Assignee: Apache Spark > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960277#comment-15960277 ] Maciej Szymkiewicz commented on SPARK-20208: [~felixcheung] I am working on this but it is still in WIP. I'll open a PR and ping when it is ready. > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960277#comment-15960277 ] Maciej Szymkiewicz edited comment on SPARK-20208 at 4/7/17 4:52 AM: [~felixcheung] I am working on this but it is still WIP. I'll open a PR and ping when it is ready. was (Author: zero323): [~felixcheung] I am working on this but it is still in WIP. I'll open a PR and ping when it is ready. > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribut
[ https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960275#comment-15960275 ] Kazuaki Ishizaki commented on SPARK-20241: -- I realized that it is not easy to reproduce this error. Would it be possible to post a program to reproduce this issue? > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > > > Key: SPARK-20241 > URL: https://issues.apache.org/jira/browse/SPARK-20241 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: 翟玉勇 >Priority: Minor > > do some sql operate for example :insert into table mytable select * from > othertable > there are WARN log: > {code} > 17/04/06 17:04:55 WARN > [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- > main]: Error calculating stats of compiled class. > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164) > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905) >
[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960267#comment-15960267 ] Felix Cheung commented on SPARK-20208: -- ping [~zero323] > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19825) spark.ml R API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19825: - Target Version/s: 2.2.0 > spark.ml R API for FPGrowth > --- > > Key: SPARK-19825 > URL: https://issues.apache.org/jira/browse/SPARK-19825 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19825) spark.ml R API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19825. -- Resolution: Fixed Assignee: Maciej Szymkiewicz > spark.ml R API for FPGrowth > --- > > Key: SPARK-19825 > URL: https://issues.apache.org/jira/browse/SPARK-19825 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19825) spark.ml R API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19825: - Fix Version/s: 2.2.0 > spark.ml R API for FPGrowth > --- > > Key: SPARK-19825 > URL: https://issues.apache.org/jira/browse/SPARK-19825 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960266#comment-15960266 ] Felix Cheung commented on SPARK-19827: -- yes > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar
[ https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960265#comment-15960265 ] Yuming Wang commented on SPARK-20247: - I will create a PR later. > Add jar but this jar is missing later shouldn't affect jobs that doesn't use > this jar > - > > Key: SPARK-20247 > URL: https://issues.apache.org/jira/browse/SPARK-20247 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > The stack trace: > {noformat} > 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID > 199) > java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found. > at > org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200 > {noformat} > How to reporduce: > {code:java} > test("add jar but this jar is missing later") { > val tmpDir = Utils.createTempDir() > val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir) > sc = new SparkContext(new > SparkConf().setAppName("test").setMaster("local")) > sc.addJar(tmpJar.getAbsolutePath) > FileUtils.deleteQuietly(tmpJar) > assert (sc.parallelize(Array(1, 2, 3)).count === 3) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr..
[jira] [Created] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar
Yuming Wang created SPARK-20247: --- Summary: Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar Key: SPARK-20247 URL: https://issues.apache.org/jira/browse/SPARK-20247 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Yuming Wang The stack trace: {noformat} 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID 199) java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found. at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:745) 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200 {noformat} How to reporduce: {code:java} test("add jar but this jar is missing later") { val tmpDir = Utils.createTempDir() val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir) sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")) sc.addJar(tmpJar.getAbsolutePath) FileUtils.deleteQuietly(tmpJar) assert (sc.parallelize(Array(1, 2, 3)).count === 3) } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16957: Assignee: Apache Spark > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Apache Spark >Priority: Trivial > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960260#comment-15960260 ] Apache Spark commented on SPARK-16957: -- User 'facaiy' has created a pull request for this issue: https://github.com/apache/spark/pull/17556 > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16957: Assignee: (was: Apache Spark) > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960177#comment-15960177 ] Yan Facai (颜发才) commented on SPARK-16957: - I think that it is helpful for small dataset, while trivial for large dataset. The task is easy. However, is it needed? If the issue would be shepherd, I'd like to work on it. > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-20240: -- Affects Version/s: (was: 2.1.0) > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960147#comment-15960147 ] Gowtham commented on SPARK-13210: - we are facing same exception in spark 2.1.0 java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:347) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.6.1, 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org
[jira] [Commented] (SPARK-19495) Make SQLConf slightly more extensible
[ https://issues.apache.org/jira/browse/SPARK-19495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960004#comment-15960004 ] Apache Spark commented on SPARK-19495: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17555 > Make SQLConf slightly more extensible > - > > Key: SPARK-19495 > URL: https://issues.apache.org/jira/browse/SPARK-19495 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > SQLConf cannot be registered by external libraries right now due to the > visibility limitations. The ticket proposes making them slightly more > extensible by removing those visibility limitations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001 ] Liang-Chi Hsieh edited comment on SPARK-20226 at 4/6/17 11:59 PM: -- {{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be set through SQLConf. I am not sure if your local.conf only covers Spark configuration via SparkConf. Can you explicitly set this flag in your application through SQLConf? was (Author: viirya): {{spark.sql.constraintPropagation.enabled}} is a SQL config flag. I am not sure if your local.conf only covers Spark configuration via SparkConf. Can you explicitly set this flag in your application through SQLConf? > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons > Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons > Number_CLEANED__" >} >
[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001 ] Liang-Chi Hsieh commented on SPARK-20226: - {{spark.sql.constraintPropagation.enabled}} is a SQL config flag. I am not sure if your local.conf only covers Spark configuration via SparkConf. Can you explicitly set this flag in your application through SQLConf? > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons > Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons > Number_CLEANED__" >} >}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0", > "uid":"bucketizer_f5db4fb8120e", > "paramMap":{ > > "splits":["-Inf",1.435215616E9,1.443855616E9,1.447271936E9,1.448222464E9,1.448395264E9,1.448481536E9,1.448827136E9,1.449259264E9,1
[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-20246: --- Labels: correctness (was: ) > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > Labels: correctness > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959946#comment-15959946 ] Cheng Lian commented on SPARK-20246: [This line|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L795] should be the root cause. We didn't check determinism of the predicates before pushing them down. The same thing also applies when pushing predicates through union and window operators. cc [~cloud_fan] > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiluo Ren updated SPARK-20246: --- Description: `import org.apache.spark.sql.functions._ spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") > 0.3).orderBy("random").show` gives wrong result. In the optimized logical plan, it shows that the filter with the non-deterministic predicate is pushed beneath the aggregate operator, which should not happen. cc [~lian cheng] was: `import org.apache.spark.sql.functions._` `spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") > 0.3).orderBy("random").show` gives wrong result. In the optimized logical plan, it shows that the filter with the non-deterministic predicate is pushed beneath the aggregate operator, which should not happen. cc [~lian cheng] > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > > `import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show` > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiluo Ren updated SPARK-20246: --- Description: {code}import org.apache.spark.sql.functions._ spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") > 0.3).orderBy("random").show{code} gives wrong result. In the optimized logical plan, it shows that the filter with the non-deterministic predicate is pushed beneath the aggregate operator, which should not happen. cc [~lian cheng] was: `import org.apache.spark.sql.functions._ spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") > 0.3).orderBy("random").show` gives wrong result. In the optimized logical plan, it shows that the filter with the non-deterministic predicate is pushed beneath the aggregate operator, which should not happen. cc [~lian cheng] > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiluo Ren updated SPARK-20246: --- Description: `import org.apache.spark.sql.functions._` `spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") > 0.3).orderBy("random").show` gives wrong result. In the optimized logical plan, it shows that the filter with the non-deterministic predicate is pushed beneath the aggregate operator, which should not happen. cc [~lian cheng] was: ` import org.apache.spark.sql.functions._ spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") > 0.3).orderBy("random").show ` gives wrong result. In the optimized logical plan, it shows that the filter with the non-deterministic predicate is pushed beneath the aggregate operator, which should not happen. cc [~lian cheng] > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren > > `import org.apache.spark.sql.functions._` > `spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show` > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
Weiluo Ren created SPARK-20246: -- Summary: Should check determinism when pushing predicates down through aggregation Key: SPARK-20246 URL: https://issues.apache.org/jira/browse/SPARK-20246 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Weiluo Ren ` import org.apache.spark.sql.functions._ spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") > 0.3).orderBy("random").show ` gives wrong result. In the optimized logical plan, it shows that the filter with the non-deterministic predicate is pushed beneath the aggregate operator, which should not happen. cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race
[ https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Raducanu updated SPARK-20243: Description: Introduced by SPARK-19946. DebugFilesystem.assertNoOpenStreams gets the size of the openStreams ConcurrentHashMap and then later, if the size was > 0, accesses the first element in openStreams.values. But, the ConcurrentHashMap might be cleared by another thread between getting its size and accessing it, resulting in an exception when trying to call .head on an empty collection. was:Introduced by SPARK-19946 > DebugFilesystem.assertNoOpenStreams thread race > --- > > Key: SPARK-20243 > URL: https://issues.apache.org/jira/browse/SPARK-20243 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu > > Introduced by SPARK-19946. > DebugFilesystem.assertNoOpenStreams gets the size of the openStreams > ConcurrentHashMap and then later, if the size was > 0, accesses the first > element in openStreams.values. But, the ConcurrentHashMap might be cleared by > another thread between getting its size and accessing it, resulting in an > exception when trying to call .head on an empty collection. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20026: Assignee: (was: Apache Spark) > Document R GLM Tweedie family support in programming guide and code example > --- > > Key: SPARK-20026 > URL: https://issues.apache.org/jira/browse/SPARK-20026 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20026: Assignee: Apache Spark > Document R GLM Tweedie family support in programming guide and code example > --- > > Key: SPARK-20026 > URL: https://issues.apache.org/jira/browse/SPARK-20026 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959527#comment-15959527 ] Apache Spark commented on SPARK-20026: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/17553 > Document R GLM Tweedie family support in programming guide and code example > --- > > Key: SPARK-20026 > URL: https://issues.apache.org/jira/browse/SPARK-20026 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribut
[ https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959495#comment-15959495 ] Kazuaki Ishizaki commented on SPARK-20241: -- I think that this problem is caused by using the different class loaders for {{org.codehaus.janino.util.ClassFile$CodeAttribute}} and {{org.codehaus.janino.util.ClassFile}}. I created a patch to use the same class loader. I am creating a test case. > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > > > Key: SPARK-20241 > URL: https://issues.apache.org/jira/browse/SPARK-20241 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: 翟玉勇 >Priority: Minor > > do some sql operate for example :insert into table mytable select * from > othertable > there are WARN log: > {code} > 17/04/06 17:04:55 WARN > [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- > main]: Error calculating stats of compiled class. > java.lang.IllegalArgumentException: Can not set final [B field > org.codehaus.janino.util.ClassFile$CodeAttribute.code to > org.codehaus.janino.util.ClassFile$CodeAttribute > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164) > at > sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalC
[jira] [Assigned] (SPARK-17019) Expose off-heap memory usage in various places
[ https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-17019: Assignee: Saisai Shao > Expose off-heap memory usage in various places > -- > > Key: SPARK-17019 > URL: https://issues.apache.org/jira/browse/SPARK-17019 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.2.0 > > > With SPARK-13992, Spark supports persisting data into off-heap memory, but > the usage of off-heap is not exposed currently, it is not so convenient for > user to monitor and profile, so here propose to expose off-heap memory as > well as on-heap memory usage in various places: > 1. Spark UI's executor page will display both on-heap and off-heap memory > usage. > 2. REST request returns both on-heap and off-heap memory. > 3. Also these two memory usage can be obtained programmatically from > SparkListener. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17019) Expose off-heap memory usage in various places
[ https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-17019. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 14617 [https://github.com/apache/spark/pull/14617] > Expose off-heap memory usage in various places > -- > > Key: SPARK-17019 > URL: https://issues.apache.org/jira/browse/SPARK-17019 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Priority: Minor > Fix For: 2.2.0 > > > With SPARK-13992, Spark supports persisting data into off-heap memory, but > the usage of off-heap is not exposed currently, it is not so convenient for > user to monitor and profile, so here propose to expose off-heap memory as > well as on-heap memory usage in various places: > 1. Spark UI's executor page will display both on-heap and off-heap memory > usage. > 2. REST request returns both on-heap and off-heap memory. > 3. Also these two memory usage can be obtained programmatically from > SparkListener. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959368#comment-15959368 ] yuhao yang commented on SPARK-20082: Sorry I'm occupied by some internal project this week. I'll find some time to look into it this weekend or early next week. > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20060) Support Standalone visiting secured HDFS
[ https://issues.apache.org/jira/browse/SPARK-20060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959349#comment-15959349 ] Marcelo Vanzin commented on SPARK-20060: Adding a link to SPARK-5158; they're not exactly duplicates, but maybe they should be. In both cases, a spec should be written explaining what is being done, with an explanation of how it is secure. > Support Standalone visiting secured HDFS > - > > Key: SPARK-20060 > URL: https://issues.apache.org/jira/browse/SPARK-20060 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Core >Affects Versions: 2.2.0 >Reporter: Kent Yao > > h1. Brief design > h2. Introductions > The basic issue for Standalone mode to visit kerberos secured HDFS or other > kerberized Services is how to gather the delegated tokens on the driver side > and deliver them to the executor side. > When we run Spark on Yarn, we set the tokens to the container launch context > to deliver them automatically and for long-term running issue caused by token > expiration, we have it fixed with SPARK-14743 by writing the tokens to HDFS > and updating the credential file and renewing them over and over. > When run Spark On Standalone, we currently have no implementations like Yarn > to get and deliver those tokens. > h2. Implementations > Firstly, we simply move the implementation of SPARK-14743 which is only for > yarn to core module. And we use it to gather the credentials we need, and > also we use it to update and renew with credential files on HDFS. > Secondly, credential files on secured HDFS are reachable for executors before > they get the tokens. Here we add a sequence configuration > `spark.deploy.credential. entities` which is used by the driver to put > `token.encodeToUrlString()` before launching the executors, and used by the > executors to fetch the credential as a string sequence during fetching the > driver side spark properties, and then decode them to tokens. Before setting > up the `CoarseGrainedExecutorBackend` we set the credentials to current > executor side ugi. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20245) pass output to LogicalRelation directly
[ https://issues.apache.org/jira/browse/SPARK-20245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20245: Assignee: Wenchen Fan (was: Apache Spark) > pass output to LogicalRelation directly > --- > > Key: SPARK-20245 > URL: https://issues.apache.org/jira/browse/SPARK-20245 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20245) pass output to LogicalRelation directly
[ https://issues.apache.org/jira/browse/SPARK-20245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20245: Assignee: Apache Spark (was: Wenchen Fan) > pass output to LogicalRelation directly > --- > > Key: SPARK-20245 > URL: https://issues.apache.org/jira/browse/SPARK-20245 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20245) pass output to LogicalRelation directly
[ https://issues.apache.org/jira/browse/SPARK-20245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959333#comment-15959333 ] Apache Spark commented on SPARK-20245: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17552 > pass output to LogicalRelation directly > --- > > Key: SPARK-20245 > URL: https://issues.apache.org/jira/browse/SPARK-20245 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20245) pass output to LogicalRelation directly
Wenchen Fan created SPARK-20245: --- Summary: pass output to LogicalRelation directly Key: SPARK-20245 URL: https://issues.apache.org/jira/browse/SPARK-20245 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20195) SparkR to add createTable catalog API and deprecate createExternalTable
[ https://issues.apache.org/jira/browse/SPARK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20195. -- Resolution: Fixed Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > SparkR to add createTable catalog API and deprecate createExternalTable > --- > > Key: SPARK-20195 > URL: https://issues.apache.org/jira/browse/SPARK-20195 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0 > > > Only naming differences for clarity, functionality is already supported with > and without path parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20196) Python to add catalog API for refreshByPath
[ https://issues.apache.org/jira/browse/SPARK-20196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20196. -- Resolution: Fixed Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > Python to add catalog API for refreshByPath > --- > > Key: SPARK-20196 > URL: https://issues.apache.org/jira/browse/SPARK-20196 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959185#comment-15959185 ] Steven Ruppert commented on SPARK-19870: ok, I can get that. trace logging as in slf4j TRACE level on org.apache.spark, or something else? I suppose it's worth noting that I've seen my jobs hit the "Dropping SparkListenerEvent because no remaining room in event queue. " on the master, though I don't think I saw it at the same time as this deadlock. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent
[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory
[ https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959168#comment-15959168 ] Ansgar Schulze commented on SPARK-20228: I just varied spark.executor.memory, everything else is constant. And I performed 5 runs with each configuration. > Random Forest instable results depending on spark.executor.memory > - > > Key: SPARK-20228 > URL: https://issues.apache.org/jira/browse/SPARK-20228 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ansgar Schulze > > If I deploy a random forrest modeling with example > spark.executor.memory20480M > I got another result as if i depoy the modeling with > spark.executor.memory6000M > I excpected the same results but different runtimes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959169#comment-15959169 ] Eyal Farago commented on SPARK-19870: - Steven, I meant traces produced by logging. Sean, this looks like a missed notification, one thread holds a lock and waits for a notification (via Java's synchronized,wait,notify mechanism), second thread attempts to sync on same object, while the first waits for notification+condition (that never happens). This code is annotated with lots of traces that can shed some light on the issue. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.r
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959167#comment-15959167 ] Steven Ruppert commented on SPARK-19870: It didn't seem to be a jvm-level deadlock, just that the machinery in BlockInfoManager was such that it was impossible to make forward progress. I only vaguely recall my looking into this, but something with readLocksByTask and writeLocksbyTask not lining up. It's also possible that it's actually some sort of cluster-level deadlock, because I could kill an executor that was stuck in this condition, wait for the master to relaunch it (on yarn) and the executor would get stuck in what looked like the same place. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) >
[jira] [Commented] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race
[ https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959138#comment-15959138 ] Sean Owen commented on SPARK-20243: --- This needs detail to be a JIRA. > DebugFilesystem.assertNoOpenStreams thread race > --- > > Key: SPARK-20243 > URL: https://issues.apache.org/jira/browse/SPARK-20243 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu > > Introduced by SPARK-19946 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959134#comment-15959134 ] Barry Becker commented on SPARK-20226: -- Yes. We are running through spark job-server, and local.conf is where all the spark options go (in a section called context-settings {...}). > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons > Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons > Number_CLEANED__" >} >}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0", > "uid":"bucketizer_f5db4fb8120e", > "paramMap":{ > > "splits":["-Inf",1.435215616E9,1.443855616E9,1.447271936E9,1.448222464E9,1.448395264E9,1.448481536E9,1.448827136E9,1.449259264E9,1.449432064E9,1.449518336E9,"Inf"], > "handleInvalid":"keep","outputCol":
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959131#comment-15959131 ] Sean Owen commented on SPARK-19870: --- That isn't a deadlock, in that only one thread is waiting on a lock. At least, this doesn't show it. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > A full stack trace is attached, but those seem to be the offending threads. > This happens acros
[jira] [Commented] (SPARK-8696) Streaming API for Online LDA
[ https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959088#comment-15959088 ] Rene Richard commented on SPARK-8696: - Hello, We'd like to use Online LDA to do something like change point detection. We need to have access to the intermediate topic lists after each new batch is processed. That way we can see how the topics change over time. As far as I understand it, the current implementation of OnlineLDA in MLLib doesn't expose intermediate topic lists per mini-batch processing. Will the predictOn method give us access to topics as they evolve with new data? I am relatively new to Spark but I find that having two APIs (spark.ml and spark.mllib) is a bit confusing. Will these be merged together in the future ? > Streaming API for Online LDA > > > Key: SPARK-8696 > URL: https://issues.apache.org/jira/browse/SPARK-8696 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Streaming LDA can be a natural extension from online LDA. > Yet for now we need to settle down the implementation for LDA prediction, to > support the predictOn method in the streaming version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959072#comment-15959072 ] Liang-Chi Hsieh commented on SPARK-20226: - I am not sure what the job-server local.conf is. Does it really affect Spark SQL conf? > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons > Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons > Number_CLEANED__" >} >}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0", > "uid":"bucketizer_f5db4fb8120e", > "paramMap":{ > > "splits":["-Inf",1.435215616E9,1.443855616E9,1.447271936E9,1.448222464E9,1.448395264E9,1.448481536E9,1.448827136E9,1.449259264E9,1.449432064E9,1.449518336E9,"Inf"], > "handleInvalid":"keep","outputCol":"Issue > Date_BINNED__","inputCol":"Issue Date_
[jira] [Updated] (SPARK-20244) Incorrect input size in UI with pyspark
[ https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artur Sukhenko updated SPARK-20244: --- Attachment: pyspark_incorrect_inputsize.png sparkshell_correct_inputsize.png Spark-shell and pyspark UI screenshots. > Incorrect input size in UI with pyspark > --- > > Key: SPARK-20244 > URL: https://issues.apache.org/jira/browse/SPARK-20244 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.1.0 >Reporter: Artur Sukhenko >Priority: Minor > Attachments: pyspark_incorrect_inputsize.png, > sparkshell_correct_inputsize.png > > > In Spark UI (Details for Stage) Input Size is 64.0 KB when running in > PySparkShell. > Also it is incorrect in Tasks table: > 64.0 KB / 132120575 in pyspark > 252.0 MB / 132120575 in spark-shell > I will attach screenshots. > Reproduce steps: > Run this to generate big file (press Ctrl+C after 5-6 seconds) > $ yes > /tmp/yes.txt > $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/ > $ ./bin/pyspark > {code} > Python 2.7.5 (default, Nov 6 2016, 00:28:07) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) > SparkSession available as 'spark'.{code} > >>> a = sc.textFile("/tmp/yes.txt") > >>> a.count() > Open Spark UI and check Stage 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20244) Incorrect input size in UI with pyspark
Artur Sukhenko created SPARK-20244: -- Summary: Incorrect input size in UI with pyspark Key: SPARK-20244 URL: https://issues.apache.org/jira/browse/SPARK-20244 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0, 2.0.0 Reporter: Artur Sukhenko Priority: Minor In Spark UI (Details for Stage) Input Size is 64.0 KB when running in PySparkShell. Also it is incorrect in Tasks table: 64.0 KB / 132120575 in pyspark 252.0 MB / 132120575 in spark-shell I will attach screenshots. Reproduce steps: Run this to generate big file (press Ctrl+C after 5-6 seconds) $ yes > /tmp/yes.txt $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/ $ ./bin/pyspark {code} Python 2.7.5 (default, Nov 6 2016, 00:28:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) SparkSession available as 'spark'.{code} >>> a = sc.textFile("/tmp/yes.txt") >>> a.count() Open Spark UI and check Stage 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959053#comment-15959053 ] Steven Ruppert commented on SPARK-19870: Should be attached as https://issues.apache.org/jira/secure/attachment/12856834/stack.txt . If you guys have any scenarios or configurations I can give a try, let me know. Otherwise I'll see if I can reduce this to a minimal test case. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at ja
[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959052#comment-15959052 ] Sergey commented on SPARK-19900: This issue does not reproduced in spark 2.1.0 version. > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 > seconds > 17/03/10 05:26:27 INFO Master: Removing worker > worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0 > 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347- > 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on > worker worker-20170310052411-spark-worker-2-40473 > 17/03/10 05:26:35 INFO Master: Registering app TestApplication > 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID > app-20170310052635-0001 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/1 > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/0 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-201703100522
[jira] [Created] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race
Bogdan Raducanu created SPARK-20243: --- Summary: DebugFilesystem.assertNoOpenStreams thread race Key: SPARK-20243 URL: https://issues.apache.org/jira/browse/SPARK-20243 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.2.0 Reporter: Bogdan Raducanu Introduced by SPARK-19946 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959024#comment-15959024 ] Barry Becker edited comment on SPARK-20226 at 4/6/17 2:45 PM: -- I set spark.sql.constraintPropagation.enabled to false in job-server local.conf and tried again. It did not help. It still took about 2 minutes. Oddly, setting it to true seemed to make it worse. I did find something that did work though. If I simply call cache() on the dataframe after the add column (right after step 1 above) then it runs very quickly. The time spent in cacheTable goes from 60 seconds to 0.5 seconds. I don't understand why though. I thought calling cache would only help if there was branching, but the pipeline is linear isn't it? Here is what the query plan looks like in the call to cache the dataframe before transforming with the pipeline. {code} Project [Plate#6, State#7, License Type#8, Summons Number#9, Issue Date#10, Violation Time#11, Violation#12, Judgment Entry Date#13, Fine Amount#14, Penalty Amount#15, Interest Amount#16, Reduction Amount#17, Payment Amount#18, Amount Due#19, Precinct#20, County#21, Issuing Agency#22, Violation Status#23, cast(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Plate#6, State#7), License Type#8), Violation Time#11), Violation#12), UDF(Judgment Entry Date#13)), UDF(Issue Date#10)), UDF(Summons Number#9)), UDF(Fine Amount#14)), UDF(Penalty Amount#15)), UDF(Interest Amount#16)) as string) AS columnBasedOnManyCols#43] +- Relation[Plate#6,State#7,License Type#8,Summons Number#9,Issue Date#10,Violation Time#11,Violation#12,Judgment Entry Date#13,Fine Amount#14,Penalty Amount#15,Interest Amount#16,Reduction Amount#17,Payment Amount#18,Amount Due#19,Precinct#20,County#21,Issuing Agency#22,Violation Status#23] csv {code} Here is how the query plan now looks in the call to cacheTable after transforming with the pipeline. Looks fairly similar to what it was before, but now its fast. {code} SubqueryAlias foo123, `foo123` +- Project [Plate#236, State#237, License Type#238, Summons Number#239, Issue Date#240, Violation Time#241, Violation#242, Judgment Entry Date#243, Fine Amount#244, Penalty Amount#245, Interest Amount#246, Reduction Amount#247, Payment Amount#248, Amount Due#249, Precinct#250, County#251, Issuing Agency#252, Violation Status#253, columnBasedOnManyCols#254, Penalty Amount (predicted)#2476] +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 33 more fields] +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 33 more fields] +- SubqueryAlias sql_1ea4c1b5c52e_cd062499a688, `sql_1ea4c1b5c52e_cd062499a688` +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 32 more fields] +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 31 more fields
[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959024#comment-15959024 ] Barry Becker commented on SPARK-20226: -- I set spark.sql.constraintPropagation.enabled to false in job-server local.conf and tried again. It did not help. It still took about 2 minutes. Oddly, setting it to true seemed to make it worse. I did find something that did work though. If I simply call cache() on the dataframe after the add column (right after step 1 above) then it runs very quickly. The time spent in cacheTable goes from 60 seconds to 0.5 seconds. I don't understand why though. I thought calling cache would only help of there was branching, but the pipeline is linear isn't it? Here is what the query plan looks like in the call to cache the dataframe before transforming with the pipeline. {code} Project [Plate#6, State#7, License Type#8, Summons Number#9, Issue Date#10, Violation Time#11, Violation#12, Judgment Entry Date#13, Fine Amount#14, Penalty Amount#15, Interest Amount#16, Reduction Amount#17, Payment Amount#18, Amount Due#19, Precinct#20, County#21, Issuing Agency#22, Violation Status#23, cast(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Plate#6, State#7), License Type#8), Violation Time#11), Violation#12), UDF(Judgment Entry Date#13)), UDF(Issue Date#10)), UDF(Summons Number#9)), UDF(Fine Amount#14)), UDF(Penalty Amount#15)), UDF(Interest Amount#16)) as string) AS columnBasedOnManyCols#43] +- Relation[Plate#6,State#7,License Type#8,Summons Number#9,Issue Date#10,Violation Time#11,Violation#12,Judgment Entry Date#13,Fine Amount#14,Penalty Amount#15,Interest Amount#16,Reduction Amount#17,Payment Amount#18,Amount Due#19,Precinct#20,County#21,Issuing Agency#22,Violation Status#23] csv {code} Here is how the query plan now looks in the call to cacheTable after transforming with the pipeline. Looks fairly similar to what it was before, but now its fast. {code} SubqueryAlias foo123, `foo123` +- Project [Plate#236, State#237, License Type#238, Summons Number#239, Issue Date#240, Violation Time#241, Violation#242, Judgment Entry Date#243, Fine Amount#244, Penalty Amount#245, Interest Amount#246, Reduction Amount#247, Payment Amount#248, Amount Due#249, Precinct#250, County#251, Issuing Agency#252, Violation Status#253, columnBasedOnManyCols#254, Penalty Amount (predicted)#2476] +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 33 more fields] +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 33 more fields] +- SubqueryAlias sql_1ea4c1b5c52e_cd062499a688, `sql_1ea4c1b5c52e_cd062499a688` +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 32 more fields] +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 31 more fields] +- Project [Plate#236, Plat
[jira] [Resolved] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,W
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artur Sukhenko resolved SPARK-19068. Resolution: Duplicate Fix Version/s: 2.2.0 Closing as duplicate of SPARK-19146. Messages like : {code}ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(136,WrappedArray()){code} are the result of problem discussed in SPARK-19146. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Fix For: 2.2.0 > > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetri
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rostyslav Sotnychenko updated SPARK-19068: -- Comment: was deleted (was: https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406 Method _onTaskEnd_ is synchronized, so it is being executed only in one thread simultaneously. This code is triggered on each _TaskEnd_ event and that's why _LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call _LinkedHashMap.drop_ less often - will submit PR soon.) > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958956#comment-15958956 ] Rostyslav Sotnychenko commented on SPARK-19068: --- Actually, maybe we should close this one as DUPLICATE? The same thing got fixed in SPARK-19146. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Quentin Auge updated SPARK-20227: - Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge (was: AWS emr-5.4.0 m4.xlarge) > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) > 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on > ip-172-30-0-149.ec2.internal killed by driver. > 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has > been removed (new total is 0) > {code} > All executors are inactive and thus killed after 60 seconds, the master > spends some CPU on a process that hangs indefinitely, and the workers are > idle. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdat
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958880#comment-15958880 ] Rostyslav Sotnychenko edited comment on SPARK-19068 at 4/6/17 1:14 PM: --- https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406 Method _onTaskEnd_ is synchronized, so it is being executed only in one thread simultaneously. This code is triggered on each _TaskEnd_ event and that's why _LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call _LinkedHashMap.drop_ less often - will submit PR soon. was (Author: sota): https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406 Method _onTaskEnd_ is synchronized, so it is being executed only in one thread simultaneously. This code is triggered on each _TaskEnd_ event and that's why _LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call it less often - will submit PR soon. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19802) Remote History Server
[ https://issues.apache.org/jira/browse/SPARK-19802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958896#comment-15958896 ] Ben Barnard commented on SPARK-19802: - The scenario we're trying to support is running Spark applications in on a cluster without HFS, with a scheduler such as Nomad or Kubernetes. Spark applications are run by the scheduler w/o the need for a Spark master, but we'd like a straightforward mechanism for the applications to be able to opt in to publishing to a history server. Yes, we were thinking of implementing an alternative `ApplicationHistoryProvider` to receive events in the history server, and something like `EventLoggingListener` in the driver that would send events to the server. We recognise that some refactoring may be required, but we are prepared to contribute this. > Remote History Server > - > > Key: SPARK-19802 > URL: https://issues.apache.org/jira/browse/SPARK-19802 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ben Barnard > > Currently the history server expects to find history in a filesystem > somewhere. It would be nice to have a history server that listens for > application events on a TCP port, and have a EventLoggingListener that sends > events to the listening history server instead of writing to a file. This > would allow the history server to show up-to-date history for past and > running jobs in a cluster environment that lacks a shared filesystem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958894#comment-15958894 ] Eyal Farago commented on SPARK-19870: - [~stevenruppert] can you attach traces? [~joshrosen], a glance at the torrent broadcast code shows that it explicitly [releases the read-lock|https://github.com/apache/spark/blob/d009fb369bbea0df81bbcf9c8028d14cfcaa683b/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L214] on the block even though the [iterator obtained from the BlockResult|https://github.com/apache/spark/blob/d009fb369bbea0df81bbcf9c8028d14cfcaa683b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L497] is a completion iterator that releases locks (read lock in this scenario) upon completion. is it possible that the torrent code 'double-releases' the read lock on the relevant block putting it in some inconsistent state that prevents the next read from obtaining the lock (or somehow not notify on release)? [~joshrosen], seems at least part of this is related to your work on SPARK-12757, hence tagging you. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apa
[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958880#comment-15958880 ] Rostyslav Sotnychenko commented on SPARK-19068: --- https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406 Method _onTaskEnd_ is synchronized, so it is being executed only in one thread simultaneously. This code is triggered on each _TaskEnd_ event and that's why _LinkedHashMap.drop_ is called also on each _TaskEnd_ event. My idea is to call it less often - will submit PR soon. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdat
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958880#comment-15958880 ] Rostyslav Sotnychenko edited comment on SPARK-19068 at 4/6/17 1:00 PM: --- https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406 Method _onTaskEnd_ is synchronized, so it is being executed only in one thread simultaneously. This code is triggered on each _TaskEnd_ event and that's why _LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call it less often - will submit PR soon. was (Author: sota): https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406 Method _onTaskEnd_ is synchronized, so it is being executed only in one thread simultaneously. This code is triggered on each _TaskEnd_ event and that's why _LinkedHashMap.drop_ is called also on each _TaskEnd_ event. My idea is to call it less often - will submit PR soon. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20242) Delay stop of web UI
[ https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20242: Assignee: Apache Spark > Delay stop of web UI > > > Key: SPARK-20242 > URL: https://issues.apache.org/jira/browse/SPARK-20242 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Ben Barnard >Assignee: Apache Spark > > When debugging Spark applications or deployments, it is often useful to > inspect the Web UI of the driver application. Especially when the application > is running remotely and facilities such as the history server aren't > available, it can be limiting and frustrating that the UI stops when the > application finishes. Developers sometimes add sleep statements to the end of > their programs to allow to this, but this requires rebuilding an application > to run it again with a delay. > It is useful to have a configuration property, such as spark.ui.stopDelay, > that can be use with arbitrary applications to prevent the UI from stopping > immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20242) Delay stop of web UI
[ https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958872#comment-15958872 ] Apache Spark commented on SPARK-20242: -- User 'barnardb' has created a pull request for this issue: https://github.com/apache/spark/pull/17551 > Delay stop of web UI > > > Key: SPARK-20242 > URL: https://issues.apache.org/jira/browse/SPARK-20242 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Ben Barnard > > When debugging Spark applications or deployments, it is often useful to > inspect the Web UI of the driver application. Especially when the application > is running remotely and facilities such as the history server aren't > available, it can be limiting and frustrating that the UI stops when the > application finishes. Developers sometimes add sleep statements to the end of > their programs to allow to this, but this requires rebuilding an application > to run it again with a delay. > It is useful to have a configuration property, such as spark.ui.stopDelay, > that can be use with arbitrary applications to prevent the UI from stopping > immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20242) Delay stop of web UI
[ https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20242: Assignee: (was: Apache Spark) > Delay stop of web UI > > > Key: SPARK-20242 > URL: https://issues.apache.org/jira/browse/SPARK-20242 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Ben Barnard > > When debugging Spark applications or deployments, it is often useful to > inspect the Web UI of the driver application. Especially when the application > is running remotely and facilities such as the history server aren't > available, it can be limiting and frustrating that the UI stops when the > application finishes. Developers sometimes add sleep statements to the end of > their programs to allow to this, but this requires rebuilding an application > to run it again with a delay. > It is useful to have a configuration property, such as spark.ui.stopDelay, > that can be use with arbitrary applications to prevent the UI from stopping > immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958846#comment-15958846 ] Rostyslav Sotnychenko commented on SPARK-19068: --- As a work-around one can rise value of "spark.ui.retainedTasks" to the number of task job will have or higher. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20242) Delay stop of web UI
Ben Barnard created SPARK-20242: --- Summary: Delay stop of web UI Key: SPARK-20242 URL: https://issues.apache.org/jira/browse/SPARK-20242 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 2.1.0 Reporter: Ben Barnard When debugging Spark applications or deployments, it is often useful to inspect the Web UI of the driver application. Especially when the application is running remotely and facilities such as the history server aren't available, it can be limiting and frustrating that the UI stops when the application finishes. Developers sometimes add sleep statements to the end of their programs to allow to this, but this requires rebuilding an application to run it again with a delay. It is useful to have a configuration property, such as spark.ui.stopDelay, that can be use with arbitrary applications to prevent the UI from stopping immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20242) Delay stop of web UI
[ https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958804#comment-15958804 ] Ben Barnard commented on SPARK-20242: - I will submit a pull request for this. > Delay stop of web UI > > > Key: SPARK-20242 > URL: https://issues.apache.org/jira/browse/SPARK-20242 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Ben Barnard > > When debugging Spark applications or deployments, it is often useful to > inspect the Web UI of the driver application. Especially when the application > is running remotely and facilities such as the history server aren't > available, it can be limiting and frustrating that the UI stops when the > application finishes. Developers sometimes add sleep statements to the end of > their programs to allow to this, but this requires rebuilding an application > to run it again with a delay. > It is useful to have a configuration property, such as spark.ui.stopDelay, > that can be use with arbitrary applications to prevent the UI from stopping > immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958801#comment-15958801 ] Apache Spark commented on SPARK-20240: -- User 'zenglinxi0615' has created a pull request for this issue: https://github.com/apache/spark/pull/17550 > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3, 2.1.0 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20240: Assignee: Apache Spark > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3, 2.1.0 >Reporter: zenglinxi >Assignee: Apache Spark > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20240: Assignee: (was: Apache Spark) > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3, 2.1.0 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory
[ https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958755#comment-15958755 ] Sean Owen commented on SPARK-20228: --- I don't think there's a problem here. I'm saying that if you vary these parameters you might legitimately get better or worse results. I'm also asking if you are varying these other things. Also is this consistent or just on one run? > Random Forest instable results depending on spark.executor.memory > - > > Key: SPARK-20228 > URL: https://issues.apache.org/jira/browse/SPARK-20228 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ansgar Schulze > > If I deploy a random forrest modeling with example > spark.executor.memory20480M > I got another result as if i depoy the modeling with > spark.executor.memory6000M > I excpected the same results but different runtimes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958754#comment-15958754 ] Mathieu D commented on SPARK-20082: --- [~yuhaoyan] or [~josephkb] any feedback on this approach and PR ? > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory
[ https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958719#comment-15958719 ] Ansgar Schulze commented on SPARK-20228: Should not be there a WARN message if there is a memory issue? There is one if i change the maxMemoryInMB parameter to a low value but not in this situation here. > Random Forest instable results depending on spark.executor.memory > - > > Key: SPARK-20228 > URL: https://issues.apache.org/jira/browse/SPARK-20228 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ansgar Schulze > > If I deploy a random forrest modeling with example > spark.executor.memory20480M > I got another result as if i depoy the modeling with > spark.executor.memory6000M > I excpected the same results but different runtimes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribute
[ https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 翟玉勇 updated SPARK-20241: Description: do some sql operate for example :insert into table mytable select * from othertable there are WARN log: {code} 17/04/06 17:04:55 WARN [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- main]: Error calculating stats of compiled class. java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribute at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164) at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168) at sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55) at sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) at java.lang.reflect.Field.get(Field.java:379) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:357) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:310) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(I