date:20170406

[jira] [Commented] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960371#comment-15960371
 ] 

Apache Spark commented on SPARK-20248:
--

User 'shaolinliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17561

> Spark SQL add limit parameter to enhance the reliability.
> -
>
> Key: SPARK-20248
> URL: https://issues.apache.org/jira/browse/SPARK-20248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 2.1.0
>Reporter: shaolinliu
>Priority: Minor
>
>   When we using thrift server, it is difficult to constrain the user's sql 
> statement;
>   When the user query a large table without limit, this will lead to thrift 
> server process memory occupancy lead to service instability;
>   In general, the user is not used correctly, because if you really need to 
> return the whole table:
>   1, if you use this date to compute , you can complete the computation in 
> the cluster and then return
>   2, if you want obtain the data, you can store it in hdfs
>   For the above scene, it is recommended to add a 
> "spark.sql.thriftServer.retainedResults" parameter,
>   1, when it is 0, we don not restrict user's operation
>   2, when it is greater than 0, if user query with limit, we use user's 
> limit;if not we use this to limit query's result
>   Priority user's limit is because, if the user consider the limit, in 
> general, the user is aware of the exact meaning of this query



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960335#comment-15960335
 ] 

Apache Spark commented on SPARK-20248:
--

User 'shaolinliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17560

> Spark SQL add limit parameter to enhance the reliability.
> -
>
> Key: SPARK-20248
> URL: https://issues.apache.org/jira/browse/SPARK-20248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 2.1.0
>Reporter: shaolinliu
>Priority: Minor
>
>   When we using thrift server, it is difficult to constrain the user's sql 
> statement;
>   When the user query a large table without limit, this will lead to thrift 
> server process memory occupancy lead to service instability;
>   In general, the user is not used correctly, because if you really need to 
> return the whole table:
>   1, if you use this date to compute , you can complete the computation in 
> the cluster and then return
>   2, if you want obtain the data, you can store it in hdfs
>   For the above scene, it is recommended to add a 
> "spark.sql.thriftServer.retainedResults" parameter,
>   1, when it is 0, we don not restrict user's operation
>   2, when it is greater than 0, if user query with limit, we use user's 
> limit;if not we use this to limit query's result
>   Priority user's limit is because, if the user consider the limit, in 
> general, the user is aware of the exact meaning of this query



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20248:


Assignee: Apache Spark

> Spark SQL add limit parameter to enhance the reliability.
> -
>
> Key: SPARK-20248
> URL: https://issues.apache.org/jira/browse/SPARK-20248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 2.1.0
>Reporter: shaolinliu
>Assignee: Apache Spark
>Priority: Minor
>
>   When we using thrift server, it is difficult to constrain the user's sql 
> statement;
>   When the user query a large table without limit, this will lead to thrift 
> server process memory occupancy lead to service instability;
>   In general, the user is not used correctly, because if you really need to 
> return the whole table:
>   1, if you use this date to compute , you can complete the computation in 
> the cluster and then return
>   2, if you want obtain the data, you can store it in hdfs
>   For the above scene, it is recommended to add a 
> "spark.sql.thriftServer.retainedResults" parameter,
>   1, when it is 0, we don not restrict user's operation
>   2, when it is greater than 0, if user query with limit, we use user's 
> limit;if not we use this to limit query's result
>   Priority user's limit is because, if the user consider the limit, in 
> general, the user is aware of the exact meaning of this query



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20248:


Assignee: (was: Apache Spark)

> Spark SQL add limit parameter to enhance the reliability.
> -
>
> Key: SPARK-20248
> URL: https://issues.apache.org/jira/browse/SPARK-20248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 2.1.0
>Reporter: shaolinliu
>Priority: Minor
>
>   When we using thrift server, it is difficult to constrain the user's sql 
> statement;
>   When the user query a large table without limit, this will lead to thrift 
> server process memory occupancy lead to service instability;
>   In general, the user is not used correctly, because if you really need to 
> return the whole table:
>   1, if you use this date to compute , you can complete the computation in 
> the cluster and then return
>   2, if you want obtain the data, you can store it in hdfs
>   For the above scene, it is recommended to add a 
> "spark.sql.thriftServer.retainedResults" parameter,
>   1, when it is 0, we don not restrict user's operation
>   2, when it is greater than 0, if user query with limit, we use user's 
> limit;if not we use this to limit query's result
>   Priority user's limit is because, if the user consider the limit, in 
> general, the user is aware of the exact meaning of this query



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20246:


Assignee: Apache Spark

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>Assignee: Apache Spark
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960329#comment-15960329
 ] 

Apache Spark commented on SPARK-20246:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17559

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20246:


Assignee: (was: Apache Spark)

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960325#comment-15960325
 ] 

Liang-Chi Hsieh commented on SPARK-20246:
-

For union and window, we don't have the replacement of expression. So I think 
they should be safe.

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.

2017-04-06 Thread shaolinliu (JIRA)

shaolinliu created SPARK-20248:
--

 Summary: Spark SQL add limit parameter to enhance the reliability.
 Key: SPARK-20248
 URL: https://issues.apache.org/jira/browse/SPARK-20248
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
 Environment: 2.1.0
Reporter: shaolinliu
Priority: Minor


  When we using thrift server, it is difficult to constrain the user's sql 
statement;
  When the user query a large table without limit, this will lead to thrift 
server process memory occupancy lead to service instability;
  In general, the user is not used correctly, because if you really need to 
return the whole table:
  1, if you use this date to compute , you can complete the computation in the 
cluster and then return
  2, if you want obtain the data, you can store it in hdfs

  For the above scene, it is recommended to add a 
"spark.sql.thriftServer.retainedResults" parameter,
  1, when it is 0, we don not restrict user's operation
  2, when it is greater than 0, if user query with limit, we use user's 
limit;if not we use this to limit query's result
  Priority user's limit is because, if the user consider the limit, in general, 
the user is aware of the exact meaning of this query



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960300#comment-15960300
 ] 

Liang-Chi Hsieh commented on SPARK-20246:
-

We should also check determinism of the [replaced 
expression|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L810].

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribut

2017-04-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960291#comment-15960291
 ] 

翟玉勇 commented on SPARK-20241:
-

i am sorry， it is my code added in spark cause this WARN，i change thread 
classloader in HiveSessionState but a throw exception cause not reset 
classloader back

> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
> 
>
> Key: SPARK-20241
> URL: https://issues.apache.org/jira/browse/SPARK-20241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: 翟玉勇
>Priority: Minor
>
> do some sql operate for example :insert into table mytable select * from 
> othertable 
> there are WARN log:
> {code}
> 17/04/06 17:04:55 WARN 
> [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- 
> main]: Error calculating stats of compiled class.
> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerato

[jira] [Closed] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribute

2017-04-06 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

翟玉勇 closed SPARK-20241.
---
Resolution: Not A Bug

> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
> 
>
> Key: SPARK-20241
> URL: https://issues.apache.org/jira/browse/SPARK-20241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: 翟玉勇
>Priority: Minor
>
> do some sql operate for example :insert into table mytable select * from 
> othertable 
> there are WARN log:
> {code}
> 17/04/06 17:04:55 WARN 
> [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- 
> main]: Error calculating stats of compiled class.
> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1

[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960287#comment-15960287
 ] 

Liang-Chi Hsieh commented on SPARK-20246:
-

Seems we have checked determinism in 
[here|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L805].

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001
 ] 

Liang-Chi Hsieh edited comment on SPARK-20226 at 4/7/17 5:19 AM:
-

{{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be 
set through SQLConf. I am afraid that the configs in local.conf only cover 
Spark configuration via SparkConf. Can you explicitly set this flag in your 
application through SQLConf?


was (Author: viirya):
{{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be 
set through SQLConf. I am afraid that the configs in local.conf only covers 
Spark configuration via SparkConf. Can you explicitly set this flag in your 
application through SQLConf?

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputC

[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001
 ] 

Liang-Chi Hsieh edited comment on SPARK-20226 at 4/7/17 5:19 AM:
-

{{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be 
set through SQLConf. I am afraid that the configs in local.conf only covers 
Spark configuration via SparkConf. Can you explicitly set this flag in your 
application through SQLConf?


was (Author: viirya):
{{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be 
set through SQLConf. I am not sure if your local.conf only covers Spark 
configuration via SparkConf. Can you explicitly set this flag in your 
application through SQLConf?

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summ

[jira] [Issue Comment Deleted] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar

2017-04-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20247:

Comment: was deleted

(was: I will create a PR later.)

> Add jar but this jar is missing later shouldn't affect jobs that doesn't use 
> this jar
> -
>
> Key: SPARK-20247
> URL: https://issues.apache.org/jira/browse/SPARK-20247
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> The stack trace:
> {noformat}
> 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID 
> 199)
> java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found.
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200
> {noformat}
> How to reporduce:
> {code:java}
>   test("add jar but this jar is missing later") {
> val tmpDir = Utils.createTempDir()
> val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir)
> sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local"))
> sc.addJar(tmpJar.getAbsolutePath)
> FileUtils.deleteQuietly(tmpJar)
> assert (sc.parallelize(Array(1, 2, 3)).count === 3)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For ad

[jira] [Assigned] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20247:


Assignee: Apache Spark

> Add jar but this jar is missing later shouldn't affect jobs that doesn't use 
> this jar
> -
>
> Key: SPARK-20247
> URL: https://issues.apache.org/jira/browse/SPARK-20247
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> The stack trace:
> {noformat}
> 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID 
> 199)
> java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found.
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200
> {noformat}
> How to reporduce:
> {code:java}
>   test("add jar but this jar is missing later") {
> val tmpDir = Utils.createTempDir()
> val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir)
> sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local"))
> sc.addJar(tmpJar.getAbsolutePath)
> FileUtils.deleteQuietly(tmpJar)
> assert (sc.parallelize(Array(1, 2, 3)).count === 3)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apac

[jira] [Assigned] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20247:


Assignee: (was: Apache Spark)

> Add jar but this jar is missing later shouldn't affect jobs that doesn't use 
> this jar
> -
>
> Key: SPARK-20247
> URL: https://issues.apache.org/jira/browse/SPARK-20247
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> The stack trace:
> {noformat}
> 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID 
> 199)
> java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found.
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200
> {noformat}
> How to reporduce:
> {code:java}
>   test("add jar but this jar is missing later") {
> val tmpDir = Utils.createTempDir()
> val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir)
> sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local"))
> sc.addJar(tmpJar.getAbsolutePath)
> FileUtils.deleteQuietly(tmpJar)
> assert (sc.parallelize(Array(1, 2, 3)).count === 3)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional com

[jira] [Commented] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960279#comment-15960279
 ] 

Apache Spark commented on SPARK-20247:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17558

> Add jar but this jar is missing later shouldn't affect jobs that doesn't use 
> this jar
> -
>
> Key: SPARK-20247
> URL: https://issues.apache.org/jira/browse/SPARK-20247
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> The stack trace:
> {noformat}
> 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID 
> 199)
> java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found.
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200
> {noformat}
> How to reporduce:
> {code:java}
>   test("add jar but this jar is missing later") {
> val tmpDir = Utils.createTempDir()
> val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir)
> sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local"))
> sc.addJar(tmpJar.getAbsolutePath)
> FileUtils.deleteQuietly(tmpJar)
> assert (sc.parallelize(Array(1, 2, 3)).count === 3)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

--

[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20208:


Assignee: (was: Apache Spark)

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960278#comment-15960278
 ] 

Apache Spark commented on SPARK-20208:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17557

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20208:


Assignee: Apache Spark

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-06 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960277#comment-15960277
 ] 

Maciej Szymkiewicz commented on SPARK-20208:


[~felixcheung] I am working on this but it is still in WIP. I'll open a PR and 
ping when it is ready. 

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-06 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960277#comment-15960277
 ] 

Maciej Szymkiewicz edited comment on SPARK-20208 at 4/7/17 4:52 AM:


[~felixcheung] I am working on this but it is still WIP. I'll open a PR and 
ping when it is ready. 


was (Author: zero323):
[~felixcheung] I am working on this but it is still in WIP. I'll open a PR and 
ping when it is ready. 

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribut

2017-04-06 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960275#comment-15960275
 ] 

Kazuaki Ishizaki commented on SPARK-20241:
--

I realized that it is not easy to reproduce this error. Would it be possible to 
post a program to reproduce this issue?

> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
> 
>
> Key: SPARK-20241
> URL: https://issues.apache.org/jira/browse/SPARK-20241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: 翟玉勇
>Priority: Minor
>
> do some sql operate for example :insert into table mytable select * from 
> othertable 
> there are WARN log:
> {code}
> 17/04/06 17:04:55 WARN 
> [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- 
> main]: Error calculating stats of compiled class.
> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
>

[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-06 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960267#comment-15960267
 ] 

Felix Cheung commented on SPARK-20208:
--

ping [~zero323]

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19825) spark.ml R API for FPGrowth

2017-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19825:
-
Target Version/s: 2.2.0

> spark.ml R API for FPGrowth
> ---
>
> Key: SPARK-19825
> URL: https://issues.apache.org/jira/browse/SPARK-19825
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19825) spark.ml R API for FPGrowth

2017-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19825.
--
Resolution: Fixed
  Assignee: Maciej Szymkiewicz

> spark.ml R API for FPGrowth
> ---
>
> Key: SPARK-19825
> URL: https://issues.apache.org/jira/browse/SPARK-19825
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19825) spark.ml R API for FPGrowth

2017-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19825:
-
Fix Version/s: 2.2.0

> spark.ml R API for FPGrowth
> ---
>
> Key: SPARK-19825
> URL: https://issues.apache.org/jira/browse/SPARK-19825
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2017-04-06 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960266#comment-15960266
 ] 

Felix Cheung commented on SPARK-19827:
--

yes

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar

2017-04-06 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960265#comment-15960265
 ] 

Yuming Wang commented on SPARK-20247:
-

I will create a PR later.

> Add jar but this jar is missing later shouldn't affect jobs that doesn't use 
> this jar
> -
>
> Key: SPARK-20247
> URL: https://issues.apache.org/jira/browse/SPARK-20247
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> The stack trace:
> {noformat}
> 17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID 
> 199)
> java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found.
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>   at java.lang.Thread.run(Thread.java:745)
> 17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200
> {noformat}
> How to reporduce:
> {code:java}
>   test("add jar but this jar is missing later") {
> val tmpDir = Utils.createTempDir()
> val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir)
> sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local"))
> sc.addJar(tmpJar.getAbsolutePath)
> FileUtils.deleteQuietly(tmpJar)
> assert (sc.parallelize(Array(1, 2, 3)).count === 3)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr..

[jira] [Created] (SPARK-20247) Add jar but this jar is missing later shouldn't affect jobs that doesn't use this jar

2017-04-06 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-20247:
---

 Summary: Add jar but this jar is missing later shouldn't affect 
jobs that doesn't use this jar
 Key: SPARK-20247
 URL: https://issues.apache.org/jira/browse/SPARK-20247
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Yuming Wang


The stack trace:
{noformat}
17/04/06 17:24:46 ERROR Executor: Exception in task 25.1 in stage 3.0 (TID 199)
java.lang.RuntimeException: Stream '/jars/test-1.0.0.jar' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/04/06 17:24:46 INFO CoarseGrainedExecutorBackend: Got assigned task 200
{noformat}

How to reporduce:
{code:java}
  test("add jar but this jar is missing later") {
val tmpDir = Utils.createTempDir()
val tmpJar = File.createTempFile("test-1.0.0", ".jar", tmpDir)

sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
sc.addJar(tmpJar.getAbsolutePath)

FileUtils.deleteQuietly(tmpJar)

assert (sc.parallelize(Array(1, 2, 3)).count === 3)
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16957) Use weighted midpoints for split values.

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16957:


Assignee: Apache Spark

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Apache Spark
>Priority: Trivial
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960260#comment-15960260
 ] 

Apache Spark commented on SPARK-16957:
--

User 'facaiy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17556

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16957) Use weighted midpoints for split values.

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16957:


Assignee: (was: Apache Spark)

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.

2017-04-06 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960177#comment-15960177
 ] 

Yan Facai (颜发才) commented on SPARK-16957:
-

I think that it is helpful for small dataset, while trivial for large dataset.

The task is easy.
However, is it needed?

If the issue would be shepherd, I'd like to work on it.

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread zenglinxi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-20240:
--
Affects Version/s: (was: 2.1.0)

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13210) NPE in Sort

2017-04-06 Thread Gowtham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960147#comment-15960147
 ] 

Gowtham commented on SPARK-13210:
-

we are facing same exception in spark 2.1.0

java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:347)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
at 
org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org

[jira] [Commented] (SPARK-19495) Make SQLConf slightly more extensible

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960004#comment-15960004
 ] 

Apache Spark commented on SPARK-19495:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17555

> Make SQLConf slightly more extensible
> -
>
> Key: SPARK-19495
> URL: https://issues.apache.org/jira/browse/SPARK-19495
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> SQLConf cannot be registered by external libraries right now due to the 
> visibility limitations. The ticket proposes making them slightly more 
> extensible by removing those visibility limitations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001
 ] 

Liang-Chi Hsieh edited comment on SPARK-20226 at 4/6/17 11:59 PM:
--

{{spark.sql.constraintPropagation.enabled}} is a SQL config flag. It must be 
set through SQLConf. I am not sure if your local.conf only covers Spark 
configuration via SparkConf. Can you explicitly set this flag in your 
application through SQLConf?


was (Author: viirya):
{{spark.sql.constraintPropagation.enabled}} is a SQL config flag. I am not sure 
if your local.conf only covers Spark configuration via SparkConf. Can you 
explicitly set this flag in your application through SQLConf?

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons
>  Number_CLEANED__"
>}
>

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960001#comment-15960001
 ] 

Liang-Chi Hsieh commented on SPARK-20226:
-

{{spark.sql.constraintPropagation.enabled}} is a SQL config flag. I am not sure 
if your local.conf only covers Spark configuration via SparkConf. Can you 
explicitly set this flag in your application through SQLConf?

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons
>  Number_CLEANED__"
>}
>}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0",
> "uid":"bucketizer_f5db4fb8120e",
> "paramMap":{
>  
> "splits":["-Inf",1.435215616E9,1.443855616E9,1.447271936E9,1.448222464E9,1.448395264E9,1.448481536E9,1.448827136E9,1.449259264E9,1

[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20246:
---
Labels: correctness  (was: )

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959946#comment-15959946
 ] 

Cheng Lian commented on SPARK-20246:


[This 
line|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L795]
 should be the root cause. We didn't check determinism of the predicates before 
pushing them down.

The same thing also applies when pushing predicates through union and window 
operators.

cc [~cloud_fan]

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Weiluo Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiluo Ren updated SPARK-20246:
---
Description: 
`import org.apache.spark.sql.functions._
spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") 
> 0.3).orderBy("random").show`

gives wrong result.

 In the optimized logical plan, it shows that the filter with the 
non-deterministic predicate is pushed beneath the aggregate operator, which 
should not happen.

cc [~lian cheng]

  was:
`import org.apache.spark.sql.functions._`
`spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") 
> 0.3).orderBy("random").show`

gives wrong result.

 In the optimized logical plan, it shows that the filter with the 
non-deterministic predicate is pushed beneath the aggregate operator, which 
should not happen.

cc [~lian cheng]


> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>
> `import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show`
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Weiluo Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiluo Ren updated SPARK-20246:
---
Description: 
{code}import org.apache.spark.sql.functions._
spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") 
> 0.3).orderBy("random").show{code}

gives wrong result.

 In the optimized logical plan, it shows that the filter with the 
non-deterministic predicate is pushed beneath the aggregate operator, which 
should not happen.

cc [~lian cheng]

  was:
`import org.apache.spark.sql.functions._
spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") 
> 0.3).orderBy("random").show`

gives wrong result.

 In the optimized logical plan, it shows that the filter with the 
non-deterministic predicate is pushed beneath the aggregate operator, which 
should not happen.

cc [~lian cheng]


> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Weiluo Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiluo Ren updated SPARK-20246:
---
Description: 
`import org.apache.spark.sql.functions._`
`spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") 
> 0.3).orderBy("random").show`

gives wrong result.

 In the optimized logical plan, it shows that the filter with the 
non-deterministic predicate is pushed beneath the aggregate operator, which 
should not happen.

cc [~lian cheng]

  was:
`
import org.apache.spark.sql.functions._
spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") 
> 0.3).orderBy("random").show
`
gives wrong result.

 In the optimized logical plan, it shows that the filter with the 
non-deterministic predicate is pushed beneath the aggregate operator, which 
should not happen.

cc [~lian cheng]


> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>
> `import org.apache.spark.sql.functions._`
> `spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show`
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Weiluo Ren (JIRA)

Weiluo Ren created SPARK-20246:
--

 Summary: Should check determinism when pushing predicates down 
through aggregation
 Key: SPARK-20246
 URL: https://issues.apache.org/jira/browse/SPARK-20246
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Weiluo Ren


`
import org.apache.spark.sql.functions._
spark.range(1,1000).distinct.withColumn("random", rand()).filter(col("random") 
> 0.3).orderBy("random").show
`
gives wrong result.

 In the optimized logical plan, it shows that the filter with the 
non-deterministic predicate is pushed beneath the aggregate operator, which 
should not happen.

cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-06 Thread Bogdan Raducanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-20243:

Description: 
Introduced by SPARK-19946.

DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
ConcurrentHashMap and then later, if the size was > 0, accesses the first 
element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
another thread between getting its size and accessing it, resulting in an 
exception when trying to call .head on an empty collection.


  was:Introduced by SPARK-19946


> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> Introduced by SPARK-19946.
> DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
> ConcurrentHashMap and then later, if the size was > 0, accesses the first 
> element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
> another thread between getting its size and accessing it, resulting in an 
> exception when trying to call .head on an empty collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20026:


Assignee: (was: Apache Spark)

> Document R GLM Tweedie family support in programming guide and code example
> ---
>
> Key: SPARK-20026
> URL: https://issues.apache.org/jira/browse/SPARK-20026
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20026:


Assignee: Apache Spark

> Document R GLM Tweedie family support in programming guide and code example
> ---
>
> Key: SPARK-20026
> URL: https://issues.apache.org/jira/browse/SPARK-20026
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959527#comment-15959527
 ] 

Apache Spark commented on SPARK-20026:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17553

> Document R GLM Tweedie family support in programming guide and code example
> ---
>
> Key: SPARK-20026
> URL: https://issues.apache.org/jira/browse/SPARK-20026
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribut

2017-04-06 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959495#comment-15959495
 ] 

Kazuaki Ishizaki commented on SPARK-20241:
--

I think that this problem is caused by using the different class loaders for 
{{org.codehaus.janino.util.ClassFile$CodeAttribute}} and 
{{org.codehaus.janino.util.ClassFile}}.
I created a patch to use the same class loader. I am creating a test case.

> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
> 
>
> Key: SPARK-20241
> URL: https://issues.apache.org/jira/browse/SPARK-20241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: 翟玉勇
>Priority: Minor
>
> do some sql operate for example :insert into table mytable select * from 
> othertable 
> there are WARN log:
> {code}
> 17/04/06 17:04:55 WARN 
> [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- 
> main]: Error calculating stats of compiled class.
> java.lang.IllegalArgumentException: Can not set final [B field 
> org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
> org.codehaus.janino.util.ClassFile$CodeAttribute
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalC

[jira] [Assigned] (SPARK-17019) Expose off-heap memory usage in various places

2017-04-06 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-17019:


Assignee: Saisai Shao

> Expose off-heap memory usage in various places
> --
>
> Key: SPARK-17019
> URL: https://issues.apache.org/jira/browse/SPARK-17019
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.2.0
>
>
> With SPARK-13992, Spark supports persisting data into off-heap memory, but 
> the usage of off-heap is not exposed currently, it is not so convenient for 
> user to monitor and profile, so here propose to expose off-heap memory as 
> well as on-heap memory usage in various places:
> 1. Spark UI's executor page will display both on-heap and off-heap memory 
> usage.
> 2. REST request returns both on-heap and off-heap memory.
> 3. Also these two memory usage can be obtained programmatically from 
> SparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17019) Expose off-heap memory usage in various places

2017-04-06 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-17019.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 14617
[https://github.com/apache/spark/pull/14617]

> Expose off-heap memory usage in various places
> --
>
> Key: SPARK-17019
> URL: https://issues.apache.org/jira/browse/SPARK-17019
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Priority: Minor
> Fix For: 2.2.0
>
>
> With SPARK-13992, Spark supports persisting data into off-heap memory, but 
> the usage of off-heap is not exposed currently, it is not so convenient for 
> user to monitor and profile, so here propose to expose off-heap memory as 
> well as on-heap memory usage in various places:
> 1. Spark UI's executor page will display both on-heap and off-heap memory 
> usage.
> 2. REST request returns both on-heap and off-heap memory.
> 3. Also these two memory usage can be obtained programmatically from 
> SparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-04-06 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959368#comment-15959368
 ] 

yuhao yang commented on SPARK-20082:


Sorry I'm occupied by some internal project this week. I'll find some time to 
look into it this weekend or early next week.

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20060) Support Standalone visiting secured HDFS

2017-04-06 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959349#comment-15959349
 ] 

Marcelo Vanzin commented on SPARK-20060:


Adding a link to SPARK-5158; they're not exactly duplicates, but maybe they 
should be. In both cases, a spec should be written explaining what is being 
done, with an explanation of how it is secure.

> Support Standalone visiting secured HDFS 
> -
>
> Key: SPARK-20060
> URL: https://issues.apache.org/jira/browse/SPARK-20060
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> h1. Brief design
> h2. Introductions
> The basic issue for Standalone mode to visit kerberos secured HDFS or other 
> kerberized Services is how to gather the delegated tokens on the driver side 
> and deliver them to the executor side. 
> When we run Spark on Yarn, we set the tokens to the container launch context 
> to deliver them automatically and for long-term running issue caused by token 
> expiration, we have it fixed with SPARK-14743 by writing the tokens to HDFS 
> and updating the credential file and renewing them over and over.  
> When run Spark On Standalone, we currently have no implementations like Yarn 
> to get and deliver those tokens.
> h2. Implementations
> Firstly, we simply move the implementation of SPARK-14743 which is only for 
> yarn to core module. And we use it to gather the credentials we need, and 
> also we use it to update and renew with credential files on HDFS.
> Secondly, credential files on secured HDFS are reachable for executors before 
> they get the tokens. Here we add a sequence configuration 
> `spark.deploy.credential. entities` which is used by the driver to put 
> `token.encodeToUrlString()` before launching the executors, and used by the 
> executors to fetch the credential as a string sequence during fetching the 
> driver side spark properties, and then decode them to tokens.  Before setting 
> up the `CoarseGrainedExecutorBackend` we set the credentials to current 
> executor side ugi. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20245) pass output to LogicalRelation directly

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20245:


Assignee: Wenchen Fan  (was: Apache Spark)

> pass output to LogicalRelation directly
> ---
>
> Key: SPARK-20245
> URL: https://issues.apache.org/jira/browse/SPARK-20245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20245) pass output to LogicalRelation directly

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20245:


Assignee: Apache Spark  (was: Wenchen Fan)

> pass output to LogicalRelation directly
> ---
>
> Key: SPARK-20245
> URL: https://issues.apache.org/jira/browse/SPARK-20245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20245) pass output to LogicalRelation directly

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959333#comment-15959333
 ] 

Apache Spark commented on SPARK-20245:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17552

> pass output to LogicalRelation directly
> ---
>
> Key: SPARK-20245
> URL: https://issues.apache.org/jira/browse/SPARK-20245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20245) pass output to LogicalRelation directly

2017-04-06 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-20245:
---

 Summary: pass output to LogicalRelation directly
 Key: SPARK-20245
 URL: https://issues.apache.org/jira/browse/SPARK-20245
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20195) SparkR to add createTable catalog API and deprecate createExternalTable

2017-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20195.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> SparkR to add createTable catalog API and deprecate createExternalTable
> ---
>
> Key: SPARK-20195
> URL: https://issues.apache.org/jira/browse/SPARK-20195
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>
> Only naming differences for clarity, functionality is already supported with 
> and without path parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20196) Python to add catalog API for refreshByPath

2017-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20196.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Python to add catalog API for refreshByPath
> ---
>
> Key: SPARK-20196
> URL: https://issues.apache.org/jira/browse/SPARK-20196
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2017-04-06 Thread Steven Ruppert (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959185#comment-15959185
 ] 

Steven Ruppert commented on SPARK-19870:


ok, I can get that. trace logging as in slf4j TRACE level on org.apache.spark, 
or something else?

I suppose it's worth noting that I've seen my jobs hit the "Dropping 
SparkListenerEvent because no remaining room in event queue. " on the master, 
though I don't think I saw it at the same time as this deadlock.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent

[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory

2017-04-06 Thread Ansgar Schulze (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959168#comment-15959168
 ] 

Ansgar Schulze commented on SPARK-20228:


I just varied spark.executor.memory, everything else is constant. And I 
performed 5 runs with each configuration.

> Random Forest instable results depending on spark.executor.memory
> -
>
> Key: SPARK-20228
> URL: https://issues.apache.org/jira/browse/SPARK-20228
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ansgar Schulze
>
> If I deploy a random forrest modeling with example 
> spark.executor.memory20480M
> I got another result as if i depoy the modeling with
> spark.executor.memory6000M
> I excpected the same results but different runtimes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2017-04-06 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959169#comment-15959169
 ] 

Eyal Farago commented on SPARK-19870:
-

Steven, I meant traces produced by logging.
Sean, this looks like a missed notification, one thread holds a lock and waits 
for a notification (via Java's synchronized,wait,notify mechanism), second 
thread attempts to sync on same object, while the first waits for 
notification+condition (that never happens).
This code is annotated with lots of traces that can shed some light on the 
issue.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.r

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2017-04-06 Thread Steven Ruppert (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959167#comment-15959167
 ] 

Steven Ruppert commented on SPARK-19870:


It didn't seem to be a jvm-level deadlock, just that the machinery in 
BlockInfoManager was such that it was impossible to make forward progress. I 
only vaguely recall my looking into this, but something with readLocksByTask 
and writeLocksbyTask not lining up.

It's also possible that it's actually some sort of cluster-level deadlock, 
because I could kill an executor that was stuck in this condition, wait for the 
master to relaunch it (on yarn) and the executor would get stuck in what looked 
like the same place.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>

[jira] [Commented] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959138#comment-15959138
 ] 

Sean Owen commented on SPARK-20243:
---

This needs detail to be a JIRA.

> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> Introduced by SPARK-19946



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959134#comment-15959134
 ] 

Barry Becker commented on SPARK-20226:
--

Yes. We are running through spark job-server, and local.conf is where all the 
spark options go (in a section called context-settings {...}).

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons
>  Number_CLEANED__"
>}
>}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0",
> "uid":"bucketizer_f5db4fb8120e",
> "paramMap":{
>  
> "splits":["-Inf",1.435215616E9,1.443855616E9,1.447271936E9,1.448222464E9,1.448395264E9,1.448481536E9,1.448827136E9,1.449259264E9,1.449432064E9,1.449518336E9,"Inf"],
>   "handleInvalid":"keep","outputCol":

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2017-04-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959131#comment-15959131
 ] 

Sean Owen commented on SPARK-19870:
---

That isn't a deadlock, in that only one thread is waiting on a lock. At least, 
this doesn't show it.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> A full stack trace is attached, but those seem to be the offending threads.
> This happens acros

[jira] [Commented] (SPARK-8696) Streaming API for Online LDA

2017-04-06 Thread Rene Richard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959088#comment-15959088
 ] 

Rene Richard commented on SPARK-8696:
-

Hello, 

We'd like to use Online LDA to do something like change point detection. We 
need to have access to the intermediate topic lists after each new batch is 
processed. That way we can see how the topics change over time. As far as I 
understand it, the current implementation of OnlineLDA in MLLib doesn't expose 
intermediate topic lists per mini-batch processing. Will the predictOn method 
give us access to topics as they evolve with new data? I am relatively new to 
Spark but I find that having two APIs (spark.ml and spark.mllib) is a bit 
confusing. Will these be merged together in the future ?

> Streaming API for Online LDA
> 
>
> Key: SPARK-8696
> URL: https://issues.apache.org/jira/browse/SPARK-8696
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Streaming LDA can be a natural extension from online LDA. 
> Yet for now we need to settle down the implementation for LDA prediction, to 
> support the predictOn method in the streaming version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959072#comment-15959072
 ] 

Liang-Chi Hsieh commented on SPARK-20226:
-

I am not sure what the job-server local.conf is. Does it really affect Spark 
SQL conf?

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons
>  Number_CLEANED__"
>}
>}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0",
> "uid":"bucketizer_f5db4fb8120e",
> "paramMap":{
>  
> "splits":["-Inf",1.435215616E9,1.443855616E9,1.447271936E9,1.448222464E9,1.448395264E9,1.448481536E9,1.448827136E9,1.449259264E9,1.449432064E9,1.449518336E9,"Inf"],
>   "handleInvalid":"keep","outputCol":"Issue 
> Date_BINNED__","inputCol":"Issue Date_

[jira] [Updated] (SPARK-20244) Incorrect input size in UI with pyspark

2017-04-06 Thread Artur Sukhenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur Sukhenko updated SPARK-20244:
---
Attachment: pyspark_incorrect_inputsize.png
sparkshell_correct_inputsize.png

Spark-shell and pyspark UI screenshots.

> Incorrect input size in UI with pyspark
> ---
>
> Key: SPARK-20244
> URL: https://issues.apache.org/jira/browse/SPARK-20244
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Artur Sukhenko
>Priority: Minor
> Attachments: pyspark_incorrect_inputsize.png, 
> sparkshell_correct_inputsize.png
>
>
> In Spark UI (Details for Stage) Input Size is  64.0 KB when running in 
> PySparkShell. 
> Also it is incorrect in Tasks table:
> 64.0 KB / 132120575 in pyspark
> 252.0 MB / 132120575 in spark-shell
> I will attach screenshots.
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> $ yes > /tmp/yes.txt
> $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> $ ./bin/pyspark
> {code}
> Python 2.7.5 (default, Nov  6 2016, 00:28:07) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
> SparkSession available as 'spark'.{code}
> >>> a = sc.textFile("/tmp/yes.txt")
> >>> a.count()
> Open Spark UI and check Stage 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20244) Incorrect input size in UI with pyspark

2017-04-06 Thread Artur Sukhenko (JIRA)

Artur Sukhenko created SPARK-20244:
--

 Summary: Incorrect input size in UI with pyspark
 Key: SPARK-20244
 URL: https://issues.apache.org/jira/browse/SPARK-20244
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0, 2.0.0
Reporter: Artur Sukhenko
Priority: Minor


In Spark UI (Details for Stage) Input Size is  64.0 KB when running in 
PySparkShell. 
Also it is incorrect in Tasks table:
64.0 KB / 132120575 in pyspark
252.0 MB / 132120575 in spark-shell

I will attach screenshots.

Reproduce steps:
Run this  to generate big file (press Ctrl+C after 5-6 seconds)
$ yes > /tmp/yes.txt
$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
$ ./bin/pyspark
{code}
Python 2.7.5 (default, Nov  6 2016, 00:28:07) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
SparkSession available as 'spark'.{code}
>>> a = sc.textFile("/tmp/yes.txt")
>>> a.count()


Open Spark UI and check Stage 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2017-04-06 Thread Steven Ruppert (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959053#comment-15959053
 ] 

Steven Ruppert commented on SPARK-19870:


Should be attached as 
https://issues.apache.org/jira/secure/attachment/12856834/stack.txt .

If you guys have any scenarios or configurations I can give a try, let me know. 
Otherwise I'll see if I can reduce this to a minimal test case.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at ja

[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-04-06 Thread Sergey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959052#comment-15959052
 ] 

Sergey commented on SPARK-19900:


This issue does not reproduced in spark 2.1.0 version.

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-201703100522

[jira] [Created] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-06 Thread Bogdan Raducanu (JIRA)

Bogdan Raducanu created SPARK-20243:
---

 Summary: DebugFilesystem.assertNoOpenStreams thread race
 Key: SPARK-20243
 URL: https://issues.apache.org/jira/browse/SPARK-20243
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.2.0
Reporter: Bogdan Raducanu


Introduced by SPARK-19946



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959024#comment-15959024
 ] 

Barry Becker edited comment on SPARK-20226 at 4/6/17 2:45 PM:
--

I set spark.sql.constraintPropagation.enabled to false in job-server local.conf 
and tried again.
It did not help. It still took about 2 minutes. Oddly, setting it to true 
seemed to make it worse.
I did find something that did work though. If I simply call cache() on the 
dataframe after the add column (right after step 1 above)
then it runs very quickly. The time spent in cacheTable goes from 60 seconds to 
0.5 seconds. I don't understand why though.
I thought calling cache would only help if there was branching, but the 
pipeline is linear isn't it?

Here is what the query plan looks like in the call to cache the dataframe 
before transforming with the pipeline.
{code}
Project [Plate#6, State#7, License Type#8, Summons Number#9, Issue Date#10, 
Violation Time#11, Violation#12, Judgment Entry Date#13, Fine Amount#14, 
Penalty Amount#15, Interest Amount#16, Reduction Amount#17, Payment Amount#18, 
Amount Due#19, Precinct#20, County#21, Issuing Agency#22, Violation Status#23, 
cast(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Plate#6, State#7), License 
Type#8), Violation Time#11), Violation#12), UDF(Judgment Entry Date#13)), 
UDF(Issue Date#10)), UDF(Summons Number#9)), UDF(Fine Amount#14)), UDF(Penalty 
Amount#15)), UDF(Interest Amount#16)) as string) AS columnBasedOnManyCols#43]
+- Relation[Plate#6,State#7,License Type#8,Summons Number#9,Issue 
Date#10,Violation Time#11,Violation#12,Judgment Entry Date#13,Fine 
Amount#14,Penalty Amount#15,Interest Amount#16,Reduction Amount#17,Payment 
Amount#18,Amount Due#19,Precinct#20,County#21,Issuing Agency#22,Violation 
Status#23] csv
{code}
Here is how the query plan now looks in the call to cacheTable after 
transforming with the pipeline. Looks fairly similar to what it was before, but 
now its fast.
{code}
SubqueryAlias foo123, `foo123`
+- Project [Plate#236, State#237, License Type#238, Summons Number#239, Issue 
Date#240, Violation Time#241, Violation#242, Judgment Entry Date#243, Fine 
Amount#244, Penalty Amount#245, Interest Amount#246, Reduction Amount#247, 
Payment Amount#248, Amount Due#249, Precinct#250, County#251, Issuing 
Agency#252, Violation Status#253, columnBasedOnManyCols#254, Penalty Amount 
(predicted)#2476]
   +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, 
License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, 
Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation 
Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, 
Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, 
Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, 
Interest Amount_CLEANED__#363, Interest Amount#246, Reduction 
Amount_CLEANED__#364, Reduction Amount#247, ... 33 more fields]
  +- Project [Plate#236, Plate_CLEANED__#275, State#237, 
State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons 
Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue 
Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, 
Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry 
Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty 
Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, 
Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 
33 more fields]
 +- SubqueryAlias sql_1ea4c1b5c52e_cd062499a688, 
`sql_1ea4c1b5c52e_cd062499a688`
+- Project [Plate#236, Plate_CLEANED__#275, State#237, 
State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons 
Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue 
Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, 
Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry 
Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty 
Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, 
Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 
32 more fields]
   +- Project [Plate#236, Plate_CLEANED__#275, State#237, 
State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons 
Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue 
Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, 
Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry 
Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty 
Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, 
Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 
31 more fields

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959024#comment-15959024
 ] 

Barry Becker commented on SPARK-20226:
--

I set spark.sql.constraintPropagation.enabled to false in job-server local.conf 
and tried again.
It did not help. It still took about 2 minutes. Oddly, setting it to true 
seemed to make it worse.
I did find something that did work though. If I simply call cache() on the 
dataframe after the add column (right after step 1 above)
then it runs very quickly. The time spent in cacheTable goes from 60 seconds to 
0.5 seconds. I don't understand why though.
I thought calling cache would only help of there was branching, but the 
pipeline is linear isn't it?

Here is what the query plan looks like in the call to cache the dataframe 
before transforming with the pipeline.
{code}
Project [Plate#6, State#7, License Type#8, Summons Number#9, Issue Date#10, 
Violation Time#11, Violation#12, Judgment Entry Date#13, Fine Amount#14, 
Penalty Amount#15, Interest Amount#16, Reduction Amount#17, Payment Amount#18, 
Amount Due#19, Precinct#20, County#21, Issuing Agency#22, Violation Status#23, 
cast(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Plate#6, State#7), License 
Type#8), Violation Time#11), Violation#12), UDF(Judgment Entry Date#13)), 
UDF(Issue Date#10)), UDF(Summons Number#9)), UDF(Fine Amount#14)), UDF(Penalty 
Amount#15)), UDF(Interest Amount#16)) as string) AS columnBasedOnManyCols#43]
+- Relation[Plate#6,State#7,License Type#8,Summons Number#9,Issue 
Date#10,Violation Time#11,Violation#12,Judgment Entry Date#13,Fine 
Amount#14,Penalty Amount#15,Interest Amount#16,Reduction Amount#17,Payment 
Amount#18,Amount Due#19,Precinct#20,County#21,Issuing Agency#22,Violation 
Status#23] csv
{code}
Here is how the query plan now looks in the call to cacheTable after 
transforming with the pipeline. Looks fairly similar to what it was before, but 
now its fast.
{code}
SubqueryAlias foo123, `foo123`
+- Project [Plate#236, State#237, License Type#238, Summons Number#239, Issue 
Date#240, Violation Time#241, Violation#242, Judgment Entry Date#243, Fine 
Amount#244, Penalty Amount#245, Interest Amount#246, Reduction Amount#247, 
Payment Amount#248, Amount Due#249, Precinct#250, County#251, Issuing 
Agency#252, Violation Status#253, columnBasedOnManyCols#254, Penalty Amount 
(predicted)#2476]
   +- Project [Plate#236, Plate_CLEANED__#275, State#237, State_CLEANED__#276, 
License Type#238, License Type_CLEANED__#277, Summons Number_CLEANED__#362, 
Summons Number#239, Issue Date#240, Issue Date_CLEANED__#323, Violation 
Time#241, Violation Time_CLEANED__#279, Violation#242, Violation_CLEANED__#280, 
Judgment Entry Date#243, Judgment Entry Date_CLEANED__#324, Fine Amount#244, 
Fine Amount_CLEANED__#325, Penalty Amount#245, Penalty Amount_CLEANED__#326, 
Interest Amount_CLEANED__#363, Interest Amount#246, Reduction 
Amount_CLEANED__#364, Reduction Amount#247, ... 33 more fields]
  +- Project [Plate#236, Plate_CLEANED__#275, State#237, 
State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons 
Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue 
Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, 
Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry 
Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty 
Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, 
Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 
33 more fields]
 +- SubqueryAlias sql_1ea4c1b5c52e_cd062499a688, 
`sql_1ea4c1b5c52e_cd062499a688`
+- Project [Plate#236, Plate_CLEANED__#275, State#237, 
State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons 
Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue 
Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, 
Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry 
Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty 
Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, 
Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 
32 more fields]
   +- Project [Plate#236, Plate_CLEANED__#275, State#237, 
State_CLEANED__#276, License Type#238, License Type_CLEANED__#277, Summons 
Number_CLEANED__#362, Summons Number#239, Issue Date#240, Issue 
Date_CLEANED__#323, Violation Time#241, Violation Time_CLEANED__#279, 
Violation#242, Violation_CLEANED__#280, Judgment Entry Date#243, Judgment Entry 
Date_CLEANED__#324, Fine Amount#244, Fine Amount_CLEANED__#325, Penalty 
Amount#245, Penalty Amount_CLEANED__#326, Interest Amount_CLEANED__#363, 
Interest Amount#246, Reduction Amount_CLEANED__#364, Reduction Amount#247, ... 
31 more fields]
  +- Project [Plate#236, Plat

[jira] [Resolved] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,W

2017-04-06 Thread Artur Sukhenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur Sukhenko resolved SPARK-19068.

   Resolution: Duplicate
Fix Version/s: 2.2.0

Closing as duplicate of SPARK-19146. 
Messages like :
{code}ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! 
Dropping event SparkListenerExecutorMetricsUpdate(136,WrappedArray()){code}
are the result of problem discussed in SPARK-19146.

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Fix For: 2.2.0
>
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetri

2017-04-06 Thread Rostyslav Sotnychenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rostyslav Sotnychenko updated SPARK-19068:
--
Comment: was deleted

(was: 
https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406

Method _onTaskEnd_ is synchronized, so it is being executed only in one thread 
simultaneously. This code is triggered on each _TaskEnd_ event and that's why 
_LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call 
_LinkedHashMap.drop_ less often - will submit PR soon.)

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,

2017-04-06 Thread Rostyslav Sotnychenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958956#comment-15958956
 ] 

Rostyslav Sotnychenko commented on SPARK-19068:
---

Actually, maybe we should close this one as DUPLICATE? The same thing got fixed 
in SPARK-19146.

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-06 Thread Quentin Auge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quentin Auge updated SPARK-20227:
-
Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge  (was: AWS 
emr-5.4.0 m4.xlarge)

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
> 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
> 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
> ip-172-30-0-149.ec2.internal killed by driver.
> 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has 
> been removed (new total is 0)
> {code}
> All executors are inactive and thus killed after 60 seconds, the master 
> spends some CPU on a process that hangs indefinitely, and the workers are 
> idle.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdat

2017-04-06 Thread Rostyslav Sotnychenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958880#comment-15958880
 ] 

Rostyslav Sotnychenko edited comment on SPARK-19068 at 4/6/17 1:14 PM:
---

https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406

Method _onTaskEnd_ is synchronized, so it is being executed only in one thread 
simultaneously. This code is triggered on each _TaskEnd_ event and that's why 
_LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call 
_LinkedHashMap.drop_ less often - will submit PR soon.


was (Author: sota):
https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406

Method _onTaskEnd_ is synchronized, so it is being executed only in one thread 
simultaneously. This code is triggered on each _TaskEnd_ event and that's why 
_LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call 
it less often - will submit PR soon.

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19802) Remote History Server

2017-04-06 Thread Ben Barnard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958896#comment-15958896
 ] 

Ben Barnard commented on SPARK-19802:
-

The scenario we're trying to support is running Spark applications in on a 
cluster without HFS, with a scheduler such as Nomad or Kubernetes. Spark 
applications are run by the scheduler w/o the need for a Spark master, but we'd 
like a straightforward mechanism for the applications to be able to opt in to 
publishing to a history server.

Yes, we were thinking of implementing an alternative 
`ApplicationHistoryProvider` to receive events in the history server, and 
something like `EventLoggingListener` in the driver that would send events to 
the server.

We recognise that some refactoring may be required, but we are prepared to 
contribute this.

> Remote History Server
> -
>
> Key: SPARK-19802
> URL: https://issues.apache.org/jira/browse/SPARK-19802
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ben Barnard
>
> Currently the history server expects to find history in a filesystem 
> somewhere. It would be nice to have a history server that listens for 
> application events on a TCP port, and have a EventLoggingListener that sends 
> events to the listening history server instead of writing to a file. This 
> would allow the history server to show up-to-date history for past and 
> running jobs in a cluster environment that lacks a shared filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2017-04-06 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958894#comment-15958894
 ] 

Eyal Farago commented on SPARK-19870:
-

[~stevenruppert] can you attach traces?
[~joshrosen], a glance at the torrent broadcast code shows that it explicitly 
[releases the 
read-lock|https://github.com/apache/spark/blob/d009fb369bbea0df81bbcf9c8028d14cfcaa683b/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L214]
 on the block even though the [iterator obtained from the 
BlockResult|https://github.com/apache/spark/blob/d009fb369bbea0df81bbcf9c8028d14cfcaa683b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L497]
 is a completion iterator that releases locks (read lock in this scenario) upon 
completion.

is it possible that the torrent code 'double-releases' the read lock on the 
relevant block putting it in some inconsistent state that prevents the next 
read from obtaining the lock (or somehow not notify on release)?

[~joshrosen], seems at least part of this is related to your work on 
SPARK-12757, hence tagging you.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apa

[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,

2017-04-06 Thread Rostyslav Sotnychenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958880#comment-15958880
 ] 

Rostyslav Sotnychenko commented on SPARK-19068:
---

https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406

Method _onTaskEnd_ is synchronized, so it is being executed only in one thread 
simultaneously. This code is triggered on each _TaskEnd_ event and that's why 
_LinkedHashMap.drop_ is called also on each _TaskEnd_ event. My idea is to call 
it less often - will submit PR soon.

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdat

2017-04-06 Thread Rostyslav Sotnychenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958880#comment-15958880
 ] 

Rostyslav Sotnychenko edited comment on SPARK-19068 at 4/6/17 1:00 PM:
---

https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406

Method _onTaskEnd_ is synchronized, so it is being executed only in one thread 
simultaneously. This code is triggered on each _TaskEnd_ event and that's why 
_LinkedHashMap.drop_ is also called on each _TaskEnd_ event. My idea is to call 
it less often - will submit PR soon.


was (Author: sota):
https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L406

Method _onTaskEnd_ is synchronized, so it is being executed only in one thread 
simultaneously. This code is triggered on each _TaskEnd_ event and that's why 
_LinkedHashMap.drop_ is called also on each _TaskEnd_ event. My idea is to call 
it less often - will submit PR soon.

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20242) Delay stop of web UI

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20242:


Assignee: Apache Spark

> Delay stop of web UI
> 
>
> Key: SPARK-20242
> URL: https://issues.apache.org/jira/browse/SPARK-20242
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Ben Barnard
>Assignee: Apache Spark
>
> When debugging Spark applications or deployments, it is often useful to 
> inspect the Web UI of the driver application. Especially when the application 
> is running remotely and facilities such as the history server aren't 
> available, it can be limiting and frustrating that the UI stops when the 
> application finishes. Developers sometimes add sleep statements to the end of 
> their programs to allow to this, but this requires rebuilding an application 
> to run it again with a delay.
> It is useful to have a configuration property, such as spark.ui.stopDelay, 
> that can be use with arbitrary applications to prevent the UI from stopping 
> immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20242) Delay stop of web UI

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958872#comment-15958872
 ] 

Apache Spark commented on SPARK-20242:
--

User 'barnardb' has created a pull request for this issue:
https://github.com/apache/spark/pull/17551

> Delay stop of web UI
> 
>
> Key: SPARK-20242
> URL: https://issues.apache.org/jira/browse/SPARK-20242
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Ben Barnard
>
> When debugging Spark applications or deployments, it is often useful to 
> inspect the Web UI of the driver application. Especially when the application 
> is running remotely and facilities such as the history server aren't 
> available, it can be limiting and frustrating that the UI stops when the 
> application finishes. Developers sometimes add sleep statements to the end of 
> their programs to allow to this, but this requires rebuilding an application 
> to run it again with a delay.
> It is useful to have a configuration property, such as spark.ui.stopDelay, 
> that can be use with arbitrary applications to prevent the UI from stopping 
> immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20242) Delay stop of web UI

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20242:


Assignee: (was: Apache Spark)

> Delay stop of web UI
> 
>
> Key: SPARK-20242
> URL: https://issues.apache.org/jira/browse/SPARK-20242
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Ben Barnard
>
> When debugging Spark applications or deployments, it is often useful to 
> inspect the Web UI of the driver application. Especially when the application 
> is running remotely and facilities such as the history server aren't 
> available, it can be limiting and frustrating that the UI stops when the 
> application finishes. Developers sometimes add sleep statements to the end of 
> their programs to allow to this, but this requires rebuilding an application 
> to run it again with a delay.
> It is useful to have a configuration property, such as spark.ui.stopDelay, 
> that can be use with arbitrary applications to prevent the UI from stopping 
> immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,

2017-04-06 Thread Rostyslav Sotnychenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958846#comment-15958846
 ] 

Rostyslav Sotnychenko commented on SPARK-19068:
---

As a work-around one can rise value of "spark.ui.retainedTasks" to the number 
of task job will have or higher.

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20242) Delay stop of web UI

2017-04-06 Thread Ben Barnard (JIRA)

Ben Barnard created SPARK-20242:
---

 Summary: Delay stop of web UI
 Key: SPARK-20242
 URL: https://issues.apache.org/jira/browse/SPARK-20242
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Ben Barnard


When debugging Spark applications or deployments, it is often useful to inspect 
the Web UI of the driver application. Especially when the application is 
running remotely and facilities such as the history server aren't available, it 
can be limiting and frustrating that the UI stops when the application 
finishes. Developers sometimes add sleep statements to the end of their 
programs to allow to this, but this requires rebuilding an application to run 
it again with a delay.

It is useful to have a configuration property, such as spark.ui.stopDelay, that 
can be use with arbitrary applications to prevent the UI from stopping 
immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20242) Delay stop of web UI

2017-04-06 Thread Ben Barnard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958804#comment-15958804
 ] 

Ben Barnard commented on SPARK-20242:
-

I will submit a pull request for this.

> Delay stop of web UI
> 
>
> Key: SPARK-20242
> URL: https://issues.apache.org/jira/browse/SPARK-20242
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Ben Barnard
>
> When debugging Spark applications or deployments, it is often useful to 
> inspect the Web UI of the driver application. Especially when the application 
> is running remotely and facilities such as the history server aren't 
> available, it can be limiting and frustrating that the UI stops when the 
> application finishes. Developers sometimes add sleep statements to the end of 
> their programs to allow to this, but this requires rebuilding an application 
> to run it again with a delay.
> It is useful to have a configuration property, such as spark.ui.stopDelay, 
> that can be use with arbitrary applications to prevent the UI from stopping 
> immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958801#comment-15958801
 ] 

Apache Spark commented on SPARK-20240:
--

User 'zenglinxi0615' has created a pull request for this issue:
https://github.com/apache/spark/pull/17550

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3, 2.1.0
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20240:


Assignee: Apache Spark

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3, 2.1.0
>Reporter: zenglinxi
>Assignee: Apache Spark
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20240:


Assignee: (was: Apache Spark)

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3, 2.1.0
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory

2017-04-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958755#comment-15958755
 ] 

Sean Owen commented on SPARK-20228:
---

I don't think there's a problem here. I'm saying that if you vary these 
parameters you might legitimately get better or worse results. I'm also asking 
if you are varying these other things. Also is this consistent or just on one 
run?

> Random Forest instable results depending on spark.executor.memory
> -
>
> Key: SPARK-20228
> URL: https://issues.apache.org/jira/browse/SPARK-20228
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ansgar Schulze
>
> If I deploy a random forrest modeling with example 
> spark.executor.memory20480M
> I got another result as if i depoy the modeling with
> spark.executor.memory6000M
> I excpected the same results but different runtimes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-04-06 Thread Mathieu D (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958754#comment-15958754
 ] 

Mathieu D commented on SPARK-20082:
---

[~yuhaoyan] or [~josephkb] any feedback on this approach and PR ?

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory

2017-04-06 Thread Ansgar Schulze (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958719#comment-15958719
 ] 

Ansgar Schulze commented on SPARK-20228:


Should not be there a WARN message if there is a memory issue? There is one if 
i change the maxMemoryInMB parameter to a low value but not in this situation 
here.

> Random Forest instable results depending on spark.executor.memory
> -
>
> Key: SPARK-20228
> URL: https://issues.apache.org/jira/browse/SPARK-20228
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ansgar Schulze
>
> If I deploy a random forrest modeling with example 
> spark.executor.memory20480M
> I got another result as if i depoy the modeling with
> spark.executor.memory6000M
> I excpected the same results but different runtimes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20241) java.lang.IllegalArgumentException: Can not set final [B field org.codehaus.janino.util.ClassFile$CodeAttribute.code to org.codehaus.janino.util.ClassFile$CodeAttribute

2017-04-06 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-20241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

翟玉勇 updated SPARK-20241:

Description: 
do some sql operate for example :insert into table mytable select * from 
othertable 
there are WARN log:
{code}
17/04/06 17:04:55 WARN 
[org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator(87) -- main]: 
Error calculating stats of compiled class.
java.lang.IllegalArgumentException: Can not set final [B field 
org.codehaus.janino.util.ClassFile$CodeAttribute.code to 
org.codehaus.janino.util.ClassFile$CodeAttribute
at 
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:164)
at 
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:168)
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:55)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:997)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4$$anonfun$apply$5.apply(CodeGenerator.scala:984)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:984)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1$$anonfun$apply$4.apply(CodeGenerator.scala:983)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:983)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:979)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:979)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:951)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1023)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1020)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:357)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:310)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(I

1 2 >

1 - 100 of 132 matches

Mail list logo