[jira] [Updated] (SPARK-3349) Incorrect partitioning after LIMIT operator
[ https://issues.apache.org/jira/browse/SPARK-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3349: --- Description: Reproduced by the following example: {code} import org.apache.spark.sql.catalyst.plans.Inner import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions._ val series = sql(select distinct year from sales order by year asc limit 10) val results = sql(select * from sales) series.registerTempTable(series) results.registerTempTable(results) sql(select * from results inner join series where results.year = series.year).count --- java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:189) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:189) at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:298) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:310) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:246) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:723) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1333) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} was: Reproduced by the following example: import org.apache.spark.sql.catalyst.plans.Inner import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions._ val series = sql(select distinct year from sales order by year asc limit 10) val results = sql(select * from sales) series.registerTempTable(series) results.registerTempTable(results) sql(select * from results inner join series where results.year = series.year).count --- java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at
[jira] [Updated] (SPARK-3349) Incorrect partitioning after LIMIT operator
[ https://issues.apache.org/jira/browse/SPARK-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3349: --- Assignee: Eric Liang Incorrect partitioning after LIMIT operator --- Key: SPARK-3349 URL: https://issues.apache.org/jira/browse/SPARK-3349 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang Assignee: Eric Liang Reproduced by the following example: {code} import org.apache.spark.sql.catalyst.plans.Inner import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions._ val series = sql(select distinct year from sales order by year asc limit 10) val results = sql(select * from sales) series.registerTempTable(series) results.registerTempTable(results) sql(select * from results inner join series where results.year = series.year).count --- java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:189) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:189) at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:298) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:310) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:246) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:723) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1333) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3361) Expand PEP 8 checks to include EC2 script and Python examples
[ https://issues.apache.org/jira/browse/SPARK-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3361. Resolution: Fixed Fix Version/s: (was: 1.1.0) 1.2.0 Expand PEP 8 checks to include EC2 script and Python examples - Key: SPARK-3361 URL: https://issues.apache.org/jira/browse/SPARK-3361 Project: Spark Issue Type: Sub-task Components: Build Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 Via {{tox.ini}}, expand the PEP 8 checks to include the EC2 script and all Python examples. That should cover everything. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures
[ https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3409. Resolution: Fixed Fix Version/s: 1.2.0 Avoid pulling in Exchange operator itself in Exchange's closures Key: SPARK-3409 URL: https://issues.apache.org/jira/browse/SPARK-3409 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.2.0 {code} val rdd = child.execute().mapPartitions { iter = if (sortBasedShuffleOn) { iter.map(r = (null, r.copy())) } else { val mutablePair = new MutablePair[Null, Row]() iter.map(r = mutablePair.update(null, r)) } } {code} The above snippet from Exchange references sortBasedShuffleOn within a closure, which requires pulling in the entire Exchange object in the closure. This is a tiny teeny optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124458#comment-14124458 ] Matthew Farrellee commented on SPARK-1701: -- slice vs partition has also come up on stackoverflow and just recently the user list. i'm going to write up a patch for the programming-guide to at least clarify the situation. i intend my pr to partially address this jira. Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize
Matthew Farrellee created SPARK-3425: Summary: OpenJDK - when run with jvm 1.8, should not set MaxPermSize Key: SPARK-3425 URL: https://issues.apache.org/jira/browse/SPARK-3425 Project: Spark Issue Type: Improvement Reporter: Matthew Farrellee Assignee: Adrian Wang Priority: Minor Fix For: 1.2.0 In JVM 1.8.0, MaxPermSize is no longer supported. In spark stderr output, there would be a line of Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize
[ https://issues.apache.org/jira/browse/SPARK-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124467#comment-14124467 ] Matthew Farrellee commented on SPARK-3425: -- this is still an issue for openjdk spark-class: line 111: [: openjdk18: integer expression expected OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 because the version test is specific to oracle java OpenJDK - when run with jvm 1.8, should not set MaxPermSize --- Key: SPARK-3425 URL: https://issues.apache.org/jira/browse/SPARK-3425 Project: Spark Issue Type: Improvement Reporter: Matthew Farrellee Assignee: Adrian Wang Priority: Minor Fix For: 1.2.0 In JVM 1.8.0, MaxPermSize is no longer supported. In spark stderr output, there would be a line of Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124469#comment-14124469 ] Apache Spark commented on SPARK-1701: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2299 Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124510#comment-14124510 ] Matthew Farrellee commented on SPARK-1701: -- ok, and one more https://github.com/apache/spark/pull/2304 to remove slice terminology from the python examples imho, all 4 of the PRs can be applied to master independently and in any order Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124516#comment-14124516 ] Apache Spark commented on SPARK-1701: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2303 Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124517#comment-14124517 ] Apache Spark commented on SPARK-1701: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2304 Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124515#comment-14124515 ] Apache Spark commented on SPARK-1701: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2302 Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize
[ https://issues.apache.org/jira/browse/SPARK-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124514#comment-14124514 ] Apache Spark commented on SPARK-3425: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2301 OpenJDK - when run with jvm 1.8, should not set MaxPermSize --- Key: SPARK-3425 URL: https://issues.apache.org/jira/browse/SPARK-3425 Project: Spark Issue Type: Improvement Reporter: Matthew Farrellee Assignee: Matthew Farrellee Priority: Minor Fix For: 1.2.0 In JVM 1.8.0, MaxPermSize is no longer supported. In spark stderr output, there would be a line of Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3426) Sort-based shuffle compression behavior is inconsistent
[ https://issues.apache.org/jira/browse/SPARK-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3426: - Description: We have the following configs: {code} spark.shuffle.compress spark.shuffle.spill.compress {code} When these two diverge, sort-based shuffle fails with a compression exception under certain workloads. This is because in sort-based shuffle we serve the index file (using spark.shuffle.spill.compress) as a normal shuffle file (using spark.shuffle.compress). It was unfortunate in retrospect that these two configs were exposed so we can't easily remove them. Here is how this can be reproduced. Set the following in your spark-defaults.conf: {code} spark.master local-cluster[1,1,512] spark.shuffle.spill.compress false spark.shuffle.compresstrue spark.shuffle.manager sort spark.shuffle.memoryFraction 0.001 {code} Then run the following in spark-shell: {code} sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect() {code} was: We have the following configs: (1) spark.shuffle.compress (2) spark.shuffle.spill.compress When these two diverge, sort-based shuffle fails with a compression exception under certain workloads. This is because in sort-based shuffle we serve the index file (using spark.shuffle.spill.compress) as a normal shuffle file (using spark.shuffle.compress). It was unfortunate in retrospect that these two configs were exposed so we can't easily remove them. Here is how this can be reproduced. Set the following in your spark-defaults.conf: {code} spark.master local-cluster[1,1,512] spark.shuffle.spill.compress false spark.shuffle.compresstrue spark.shuffle.manager sort spark.shuffle.memoryFraction 0.001 {code} Then run the following in spark-shell: {code} sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect() {code} Sort-based shuffle compression behavior is inconsistent --- Key: SPARK-3426 URL: https://issues.apache.org/jira/browse/SPARK-3426 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical We have the following configs: {code} spark.shuffle.compress spark.shuffle.spill.compress {code} When these two diverge, sort-based shuffle fails with a compression exception under certain workloads. This is because in sort-based shuffle we serve the index file (using spark.shuffle.spill.compress) as a normal shuffle file (using spark.shuffle.compress). It was unfortunate in retrospect that these two configs were exposed so we can't easily remove them. Here is how this can be reproduced. Set the following in your spark-defaults.conf: {code} spark.master local-cluster[1,1,512] spark.shuffle.spill.compress false spark.shuffle.compresstrue spark.shuffle.manager sort spark.shuffle.memoryFraction 0.001 {code} Then run the following in spark-shell: {code} sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3426) Sort-based shuffle compression behavior is inconsistent
[ https://issues.apache.org/jira/browse/SPARK-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3426: - Target Version/s: 1.1.1, 1.2.0 (was: 1.2.0) Sort-based shuffle compression behavior is inconsistent --- Key: SPARK-3426 URL: https://issues.apache.org/jira/browse/SPARK-3426 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical We have the following configs: {code} spark.shuffle.compress spark.shuffle.spill.compress {code} When these two diverge, sort-based shuffle fails with a compression exception under certain workloads. This is because in sort-based shuffle we serve the index file (using spark.shuffle.spill.compress) as a normal shuffle file (using spark.shuffle.compress). It was unfortunate in retrospect that these two configs were exposed so we can't easily remove them. Here is how this can be reproduced. Set the following in your spark-defaults.conf: {code} spark.master local-cluster[1,1,512] spark.shuffle.spill.compress false spark.shuffle.compresstrue spark.shuffle.manager sort spark.shuffle.memoryFraction 0.001 {code} Then run the following in spark-shell: {code} sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3426) Sort-based shuffle compression behavior is inconsistent
Andrew Or created SPARK-3426: Summary: Sort-based shuffle compression behavior is inconsistent Key: SPARK-3426 URL: https://issues.apache.org/jira/browse/SPARK-3426 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical We have the following configs: (1) spark.shuffle.compress (2) spark.shuffle.spill.compress When these two diverge, sort-based shuffle fails with a compression exception under certain workloads. This is because in sort-based shuffle we serve the index file (using spark.shuffle.spill.compress) as a normal shuffle file (using spark.shuffle.compress). It was unfortunate in retrospect that these two configs were exposed so we can't easily remove them. Here is how this can be reproduced. Set the following in your spark-defaults.conf: {code} spark.master local-cluster[1,1,512] spark.shuffle.spill.compress false spark.shuffle.compresstrue spark.shuffle.manager sort spark.shuffle.memoryFraction 0.001 {code} Then run the following in spark-shell: {code} sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee updated SPARK-1701: - Comment: was deleted (was: ok, i also created 2 other PRs https://github.com/apache/spark/pull/2302 aims to deprecate numSlices and https://github.com/apache/spark/pull/2303 is independent, removing the use of numSlices in pyspark/tests.py) Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee updated SPARK-1701: - Comment: was deleted (was: ok, and one more https://github.com/apache/spark/pull/2304 to remove slice terminology from the python examples imho, all 4 of the PRs can be applied to master independently and in any order) Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124590#comment-14124590 ] Apache Spark commented on SPARK-1701: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2305 Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124645#comment-14124645 ] Apache Spark commented on SPARK-1981: - User 'cfregly' has created a pull request for this issue: https://github.com/apache/spark/pull/2306 Add AWS Kinesis streaming support - Key: SPARK-1981 URL: https://issues.apache.org/jira/browse/SPARK-1981 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Chris Fregly Assignee: Chris Fregly Fix For: 1.1.0 Add AWS Kinesis support to Spark Streaming. Initial discussion occured here: https://github.com/apache/spark/pull/223 I discussed this with Parviz from AWS recently and we agreed that I would take this over. Look for a new PR that takes into account all the feedback from the earlier PR including spark-1.0-compliant implementation, AWS-license-aware build support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2419) Misc updates to streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124664#comment-14124664 ] Apache Spark commented on SPARK-2419: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/2307 Misc updates to streaming programming guide --- Key: SPARK-2419 URL: https://issues.apache.org/jira/browse/SPARK-2419 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. - New Flume and Kinesis stuff. - Twitter4j version - Design pattern: creating connections to external sinks - Design pattern: multiple input streams -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3406) Python persist API does not have a default storage level
[ https://issues.apache.org/jira/browse/SPARK-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3406. --- Resolution: Fixed Fix Version/s: 1.2.0 Fixed by Holden in https://github.com/apache/spark/pull/2280 Python persist API does not have a default storage level Key: SPARK-3406 URL: https://issues.apache.org/jira/browse/SPARK-3406 Project: Spark Issue Type: Bug Components: PySpark Reporter: holdenk Priority: Minor Fix For: 1.2.0 Original Estimate: 5m Remaining Estimate: 5m PySpark's persist method on RDD's does not have a default storage level. This is different than the Scala API which defaults to in memory caching. This is minor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3397) Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3397. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2268 [https://github.com/apache/spark/pull/2268] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT -- Key: SPARK-3397 URL: https://issues.apache.org/jira/browse/SPARK-3397 Project: Spark Issue Type: Improvement Components: Build Reporter: Guoqiang Li Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3301) The spark version in the welcome message of pyspark is not correct
[ https://issues.apache.org/jira/browse/SPARK-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3301. --- Resolution: Fixed Fix Version/s: 1.2.0 The spark version in the welcome message of pyspark is not correct -- Key: SPARK-3301 URL: https://issues.apache.org/jira/browse/SPARK-3301 Project: Spark Issue Type: Bug Components: PySpark Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3273) We should read the version information from the same place.
[ https://issues.apache.org/jira/browse/SPARK-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3273. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2175 [https://github.com/apache/spark/pull/2175] We should read the version information from the same place. --- Key: SPARK-3273 URL: https://issues.apache.org/jira/browse/SPARK-3273 Project: Spark Issue Type: Bug Components: Spark Shell Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3321) Defining a class within python main script
[ https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124704#comment-14124704 ] Matthew Farrellee commented on SPARK-3321: -- this has come up a few times. it's not a problem with spark, but rather an artifact of how python operates. do you have a specific suggestion on how the python interface to spark could work around this python limitation automatically? Defining a class within python main script -- Key: SPARK-3321 URL: https://issues.apache.org/jira/browse/SPARK-3321 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Environment: Python version 2.6.6 Spark version version 1.0.1 jdk1.6.0_43 Reporter: Shawn Guo Priority: Critical *leftOuterJoin(self, other, numPartitions=None)* Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. *Background*: leftOuterJoin will produce None element in result dataset. I define a new class 'Null' in the main script to replace all python native None to new 'Null' object. 'Null' object overload the [] operator. {code:title=Class Null|borderStyle=solid} class Null(object): def __getitem__(self,key): return None; def __getstate__(self): pass; def __setstate__(self, dict): pass; def convert_to_null(x): return Null() if x is None else x X = A.leftOuterJoin(B) X.mapValues(lambda line: (line[0],convert_to_null(line[1])) {code} The code seems running good in pyspark console, however spark-submit failed with below error messages: /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] /tmp/python_test.py {noformat} File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 134, in _write_with_length serialized = self.dumps(obj) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) PicklingError: Can't pickle class '__main__.Null': attribute lookup __main__.Null failed org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at
[jira] [Commented] (SPARK-3401) Wrong usage of tee command in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124705#comment-14124705 ] Matthew Farrellee commented on SPARK-3401: -- nice catch Wrong usage of tee command in python/run-tests -- Key: SPARK-3401 URL: https://issues.apache.org/jira/browse/SPARK-3401 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Kousuke Saruta Fix For: 1.1.1 In python/run-test, tee command is used with -a option to append unit-tests.log for logging but the usage is wrong. In current implementation, the output of tee command is redirected to unit-tests.log like tee -a unit-tests.log. tee command is not needed to redirect its output. This issue affects invalid truncate of unit-tests.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3401) Wrong usage of tee command in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee updated SPARK-3401: - Fix Version/s: 1.1.1 Wrong usage of tee command in python/run-tests -- Key: SPARK-3401 URL: https://issues.apache.org/jira/browse/SPARK-3401 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Kousuke Saruta Fix For: 1.1.1 In python/run-test, tee command is used with -a option to append unit-tests.log for logging but the usage is wrong. In current implementation, the output of tee command is redirected to unit-tests.log like tee -a unit-tests.log. tee command is not needed to redirect its output. This issue affects invalid truncate of unit-tests.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3401) Wrong usage of tee command in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee resolved SPARK-3401. -- Resolution: Fixed Wrong usage of tee command in python/run-tests -- Key: SPARK-3401 URL: https://issues.apache.org/jira/browse/SPARK-3401 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Kousuke Saruta Fix For: 1.1.1 In python/run-test, tee command is used with -a option to append unit-tests.log for logging but the usage is wrong. In current implementation, the output of tee command is redirected to unit-tests.log like tee -a unit-tests.log. tee command is not needed to redirect its output. This issue affects invalid truncate of unit-tests.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working
[ https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2769: --- Affects Version/s: (was: 1.0.0) 1.1.0 Ganglia Support Broken / Not working Key: SPARK-2769 URL: https://issues.apache.org/jira/browse/SPARK-2769 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Linux Red Hat 6.4 on Spark 1.1.0 Reporter: Stephen Walsh Labels: Ganglia, GraphiteSink,, Metrics Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107) at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) From looking at the code I see the following. val graphite: Graphite = new Graphite(new InetSocketAddress(host, port)) val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry) .convertDurationsTo(TimeUnit.MILLISECONDS) .convertRatesTo(TimeUnit.SECONDS) .prefixedWith(prefix) .build(graphite) https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69 Followed by override def start() { reporter.start(pollPeriod, pollUnit) } I noticed that the error fails when we first fry to send a message but nowhere do I see graphite.connect() being called? https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62 as it seems to fail on the send function.. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77 a with this.writer not initialized the writer.write will fail. The GraphiteBuilder doesn't call it either when creating the reporter object. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but very little logging has me thinking it is a bug. EDIT: found out where the connect gets called. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L153 ad his is called from here https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98 which is called form here
[jira] [Created] (SPARK-3427) Standalone version of static PageRank
Ankur Dave created SPARK-3427: - Summary: Standalone version of static PageRank Key: SPARK-3427 URL: https://issues.apache.org/jira/browse/SPARK-3427 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave GraphX's current implementation of static (fixed iteration count) PageRank is implemented using the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs: 1. A shuffle per iteration to ship the active sets to the edge partitions. 2. A hash table creation per iteration at each partition to index the active sets for lookup. 3. A hash lookup per edge to check whether the source vertex is active. I reimplemented static PageRank using the GraphX API instead of the higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3427) Avoid unnecessary active vertex tracking in static PageRank
[ https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3427: -- Summary: Avoid unnecessary active vertex tracking in static PageRank (was: Standalone version of static PageRank) Avoid unnecessary active vertex tracking in static PageRank --- Key: SPARK-3427 URL: https://issues.apache.org/jira/browse/SPARK-3427 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave GraphX's current implementation of static (fixed iteration count) PageRank is implemented using the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs: 1. A shuffle per iteration to ship the active sets to the edge partitions. 2. A hash table creation per iteration at each partition to index the active sets for lookup. 3. A hash lookup per edge to check whether the source vertex is active. I reimplemented static PageRank using the GraphX API instead of the higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3427) Avoid active vertex tracking in static PageRank
[ https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3427: -- Description: GraphX's current implementation of static (fixed iteration count) PageRank uses the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs: 1. A shuffle per iteration to ship the active sets to the edge partitions. 2. A hash table creation per iteration at each partition to index the active sets for lookup. 3. A hash lookup per edge to check whether the source vertex is active. I reimplemented static PageRank using the lower-level GraphX API instead of the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges. was: GraphX's current implementation of static (fixed iteration count) PageRank is implemented using the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs: 1. A shuffle per iteration to ship the active sets to the edge partitions. 2. A hash table creation per iteration at each partition to index the active sets for lookup. 3. A hash lookup per edge to check whether the source vertex is active. I reimplemented static PageRank using the GraphX API instead of the higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges. Avoid active vertex tracking in static PageRank --- Key: SPARK-3427 URL: https://issues.apache.org/jira/browse/SPARK-3427 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave GraphX's current implementation of static (fixed iteration count) PageRank uses the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs: 1. A shuffle per iteration to ship the active sets to the edge partitions. 2. A hash table creation per iteration at each partition to index the active sets for lookup. 3. A hash lookup per edge to check whether the source vertex is active. I reimplemented static PageRank using the lower-level GraphX API instead of the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3427) Avoid active vertex tracking in static PageRank
[ https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3427: -- Summary: Avoid active vertex tracking in static PageRank (was: Avoid unnecessary active vertex tracking in static PageRank) Avoid active vertex tracking in static PageRank --- Key: SPARK-3427 URL: https://issues.apache.org/jira/browse/SPARK-3427 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave GraphX's current implementation of static (fixed iteration count) PageRank is implemented using the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs: 1. A shuffle per iteration to ship the active sets to the edge partitions. 2. A hash table creation per iteration at each partition to index the active sets for lookup. 3. A hash lookup per edge to check whether the source vertex is active. I reimplemented static PageRank using the GraphX API instead of the higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3427) Avoid active vertex tracking in static PageRank
[ https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124732#comment-14124732 ] Apache Spark commented on SPARK-3427: - User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/2308 Avoid active vertex tracking in static PageRank --- Key: SPARK-3427 URL: https://issues.apache.org/jira/browse/SPARK-3427 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave GraphX's current implementation of static (fixed iteration count) PageRank uses the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs: 1. A shuffle per iteration to ship the active sets to the edge partitions. 2. A hash table creation per iteration at each partition to index the active sets for lookup. 3. A hash lookup per edge to check whether the source vertex is active. I reimplemented static PageRank using the lower-level GraphX API instead of the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)
[ https://issues.apache.org/jira/browse/SPARK-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3353. -- Resolution: Fixed Fix Version/s: 1.2.0 Stage id monotonicity (parent stage should have lower stage id) --- Key: SPARK-3353 URL: https://issues.apache.org/jira/browse/SPARK-3353 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.2.0 The way stage IDs are generated is that parent stages actually have higher stage id. This is very confusing because parent stages get scheduled executed first. We should reverse that order so the scheduling timeline of stages (absent of failures) is monotonic, i.e. stages that are executed first have lower stage ids. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
[ https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124744#comment-14124744 ] Apache Spark commented on SPARK-3178: - User 'bbejeck' has created a pull request for this issue: https://github.com/apache/spark/pull/2309 setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero Key: SPARK-3178 URL: https://issues.apache.org/jira/browse/SPARK-3178 Project: Spark Issue Type: Bug Environment: osx Reporter: Jon Haddad Assignee: Bill Bejeck Labels: starter This should either default to m or just completely fail. Starting a worker with zero memory isn't very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3428) TaskMetrics for running tasks is missing GC time metrics
Andrew Ash created SPARK-3428: - Summary: TaskMetrics for running tasks is missing GC time metrics Key: SPARK-3428 URL: https://issues.apache.org/jira/browse/SPARK-3428 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Assignee: Sandy Ryza SPARK-2099 added the ability to update helpful metrics like shuffle bytes read/written on the webui via a periodic heartbeat from executors. It omitted GC time metrics though. This ticket is for including GC times in the heartbeats. See the updateAggregateMetrics method here: https://github.com/apache/spark/pull/1056/files#diff-1f32bcb61f51133bd0959a4177a066a5R175 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2099) Report TaskMetrics for running tasks
[ https://issues.apache.org/jira/browse/SPARK-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124777#comment-14124777 ] Andrew Ash commented on SPARK-2099: --- Added a followon ticket for GC times as SPARK-3428 Report TaskMetrics for running tasks Key: SPARK-2099 URL: https://issues.apache.org/jira/browse/SPARK-2099 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.1.0 Spark currently collects a set of helpful task metrics, like shuffle bytes written, GC time, and displays them on the app web UI. These are only collected and displayed for tasks that have completed. This makes them unsuited to perhaps the situation where they would be most useful - determining what's going wrong in currently running tasks. Reporting metrics progrss for running tasks would probably require adding an executor-driver heartbeat that reports metrics for all tasks currently running on the executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org