[jira] [Updated] (SPARK-3349) Incorrect partitioning after LIMIT operator

2014-09-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3349:
---
Description: 
Reproduced by the following example:
{code}
import org.apache.spark.sql.catalyst.plans.Inner
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
import org.apache.spark.sql.catalyst.expressions._

val series = sql(select distinct year from sales order by year asc limit 10)
val results = sql(select * from sales)
series.registerTempTable(series)
results.registerTempTable(results)

sql(select * from results inner join series where results.year = 
series.year).count

---

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:189)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:189)
at 
org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:298)
at 
org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:310)
at 
org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:246)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:723)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1333)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

  was:
Reproduced by the following example:

import org.apache.spark.sql.catalyst.plans.Inner
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
import org.apache.spark.sql.catalyst.expressions._

val series = sql(select distinct year from sales order by year asc limit 10)
val results = sql(select * from sales)
series.registerTempTable(series)
results.registerTempTable(results)

sql(select * from results inner join series where results.year = 
series.year).count

---

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at 

[jira] [Updated] (SPARK-3349) Incorrect partitioning after LIMIT operator

2014-09-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3349:
---
Assignee: Eric Liang

 Incorrect partitioning after LIMIT operator
 ---

 Key: SPARK-3349
 URL: https://issues.apache.org/jira/browse/SPARK-3349
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang
Assignee: Eric Liang

 Reproduced by the following example:
 {code}
 import org.apache.spark.sql.catalyst.plans.Inner
 import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
 import org.apache.spark.sql.catalyst.expressions._
 val series = sql(select distinct year from sales order by year asc limit 10)
 val results = sql(select * from sales)
 series.registerTempTable(series)
 results.registerTempTable(results)
 sql(select * from results inner join series where results.year = 
 series.year).count
 ---
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
   at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
   at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:189)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:189)
   at 
 org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:298)
   at 
 org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:310)
   at 
 org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:246)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:723)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1333)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3361) Expand PEP 8 checks to include EC2 script and Python examples

2014-09-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3361.

   Resolution: Fixed
Fix Version/s: (was: 1.1.0)
   1.2.0

 Expand PEP 8 checks to include EC2 script and Python examples
 -

 Key: SPARK-3361
 URL: https://issues.apache.org/jira/browse/SPARK-3361
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.2.0


 Via {{tox.ini}}, expand the PEP 8 checks to include the EC2 script and all 
 Python examples. That should cover everything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures

2014-09-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3409.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Avoid pulling in Exchange operator itself in Exchange's closures
 

 Key: SPARK-3409
 URL: https://issues.apache.org/jira/browse/SPARK-3409
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.2.0


 {code}
 val rdd = child.execute().mapPartitions { iter =
   if (sortBasedShuffleOn) {
 iter.map(r = (null, r.copy()))
   } else {
 val mutablePair = new MutablePair[Null, Row]()
 iter.map(r = mutablePair.update(null, r))
   }
 }
 {code}
 The above snippet from Exchange references sortBasedShuffleOn within a 
 closure, which requires pulling in the entire Exchange object in the closure. 
 This is a tiny teeny optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124458#comment-14124458
 ] 

Matthew Farrellee commented on SPARK-1701:
--

slice vs partition has also come up on stackoverflow and just recently the user 
list.

i'm going to write up a patch for the programming-guide to at least clarify the 
situation.

i intend my pr to partially address this jira.

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize

2014-09-06 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created SPARK-3425:


 Summary: OpenJDK - when run with jvm 1.8, should not set 
MaxPermSize
 Key: SPARK-3425
 URL: https://issues.apache.org/jira/browse/SPARK-3425
 Project: Spark
  Issue Type: Improvement
Reporter: Matthew Farrellee
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.2.0


In JVM 1.8.0, MaxPermSize is no longer supported.
In spark stderr output, there would be a line of

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124467#comment-14124467
 ] 

Matthew Farrellee commented on SPARK-3425:
--

this is still an issue for openjdk

  spark-class: line 111: [: openjdk18: integer expression expected
  OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support 
was removed in 8.0

because the version test is specific to oracle java

 OpenJDK - when run with jvm 1.8, should not set MaxPermSize
 ---

 Key: SPARK-3425
 URL: https://issues.apache.org/jira/browse/SPARK-3425
 Project: Spark
  Issue Type: Improvement
Reporter: Matthew Farrellee
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.2.0


 In JVM 1.8.0, MaxPermSize is no longer supported.
 In spark stderr output, there would be a line of
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; 
 support was removed in 8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124469#comment-14124469
 ] 

Apache Spark commented on SPARK-1701:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2299

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124510#comment-14124510
 ] 

Matthew Farrellee commented on SPARK-1701:
--

ok, and one more

https://github.com/apache/spark/pull/2304 to remove slice terminology from the 
python examples

imho, all 4 of the PRs can be applied to master independently and in any order

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124516#comment-14124516
 ] 

Apache Spark commented on SPARK-1701:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2303

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124517#comment-14124517
 ] 

Apache Spark commented on SPARK-1701:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2304

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124515#comment-14124515
 ] 

Apache Spark commented on SPARK-1701:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2302

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124514#comment-14124514
 ] 

Apache Spark commented on SPARK-3425:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2301

 OpenJDK - when run with jvm 1.8, should not set MaxPermSize
 ---

 Key: SPARK-3425
 URL: https://issues.apache.org/jira/browse/SPARK-3425
 Project: Spark
  Issue Type: Improvement
Reporter: Matthew Farrellee
Assignee: Matthew Farrellee
Priority: Minor
 Fix For: 1.2.0


 In JVM 1.8.0, MaxPermSize is no longer supported.
 In spark stderr output, there would be a line of
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; 
 support was removed in 8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3426) Sort-based shuffle compression behavior is inconsistent

2014-09-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3426:
-
Description: 
We have the following configs:

{code}
spark.shuffle.compress
spark.shuffle.spill.compress
{code}

When these two diverge, sort-based shuffle fails with a compression exception 
under certain workloads. This is because in sort-based shuffle we serve the 
index file (using spark.shuffle.spill.compress) as a normal shuffle file (using 
spark.shuffle.compress). It was unfortunate in retrospect that these two 
configs were exposed so we can't easily remove them.

Here is how this can be reproduced. Set the following in your 
spark-defaults.conf:
{code}
spark.master  local-cluster[1,1,512]
spark.shuffle.spill.compress  false
spark.shuffle.compresstrue
spark.shuffle.manager sort
spark.shuffle.memoryFraction  0.001
{code}
Then run the following in spark-shell:
{code}
sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect()
{code}

  was:
We have the following configs:

(1) spark.shuffle.compress
(2) spark.shuffle.spill.compress

When these two diverge, sort-based shuffle fails with a compression exception 
under certain workloads. This is because in sort-based shuffle we serve the 
index file (using spark.shuffle.spill.compress) as a normal shuffle file (using 
spark.shuffle.compress). It was unfortunate in retrospect that these two 
configs were exposed so we can't easily remove them.

Here is how this can be reproduced. Set the following in your 
spark-defaults.conf:
{code}
spark.master  local-cluster[1,1,512]
spark.shuffle.spill.compress  false
spark.shuffle.compresstrue
spark.shuffle.manager sort
spark.shuffle.memoryFraction  0.001
{code}
Then run the following in spark-shell:
{code}
sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect()
{code}


 Sort-based shuffle compression behavior is inconsistent
 ---

 Key: SPARK-3426
 URL: https://issues.apache.org/jira/browse/SPARK-3426
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 We have the following configs:
 {code}
 spark.shuffle.compress
 spark.shuffle.spill.compress
 {code}
 When these two diverge, sort-based shuffle fails with a compression exception 
 under certain workloads. This is because in sort-based shuffle we serve the 
 index file (using spark.shuffle.spill.compress) as a normal shuffle file 
 (using spark.shuffle.compress). It was unfortunate in retrospect that these 
 two configs were exposed so we can't easily remove them.
 Here is how this can be reproduced. Set the following in your 
 spark-defaults.conf:
 {code}
 spark.master  local-cluster[1,1,512]
 spark.shuffle.spill.compress  false
 spark.shuffle.compresstrue
 spark.shuffle.manager sort
 spark.shuffle.memoryFraction  0.001
 {code}
 Then run the following in spark-shell:
 {code}
 sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3426) Sort-based shuffle compression behavior is inconsistent

2014-09-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3426:
-
Target Version/s: 1.1.1, 1.2.0  (was: 1.2.0)

 Sort-based shuffle compression behavior is inconsistent
 ---

 Key: SPARK-3426
 URL: https://issues.apache.org/jira/browse/SPARK-3426
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 We have the following configs:
 {code}
 spark.shuffle.compress
 spark.shuffle.spill.compress
 {code}
 When these two diverge, sort-based shuffle fails with a compression exception 
 under certain workloads. This is because in sort-based shuffle we serve the 
 index file (using spark.shuffle.spill.compress) as a normal shuffle file 
 (using spark.shuffle.compress). It was unfortunate in retrospect that these 
 two configs were exposed so we can't easily remove them.
 Here is how this can be reproduced. Set the following in your 
 spark-defaults.conf:
 {code}
 spark.master  local-cluster[1,1,512]
 spark.shuffle.spill.compress  false
 spark.shuffle.compresstrue
 spark.shuffle.manager sort
 spark.shuffle.memoryFraction  0.001
 {code}
 Then run the following in spark-shell:
 {code}
 sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3426) Sort-based shuffle compression behavior is inconsistent

2014-09-06 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3426:


 Summary: Sort-based shuffle compression behavior is inconsistent
 Key: SPARK-3426
 URL: https://issues.apache.org/jira/browse/SPARK-3426
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical


We have the following configs:

(1) spark.shuffle.compress
(2) spark.shuffle.spill.compress

When these two diverge, sort-based shuffle fails with a compression exception 
under certain workloads. This is because in sort-based shuffle we serve the 
index file (using spark.shuffle.spill.compress) as a normal shuffle file (using 
spark.shuffle.compress). It was unfortunate in retrospect that these two 
configs were exposed so we can't easily remove them.

Here is how this can be reproduced. Set the following in your 
spark-defaults.conf:
{code}
spark.master  local-cluster[1,1,512]
spark.shuffle.spill.compress  false
spark.shuffle.compresstrue
spark.shuffle.manager sort
spark.shuffle.memoryFraction  0.001
{code}
Then run the following in spark-shell:
{code}
sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-1701:
-
Comment: was deleted

(was: ok, i also created 2 other PRs

https://github.com/apache/spark/pull/2302 aims to deprecate numSlices

and

https://github.com/apache/spark/pull/2303 is independent, removing the use of 
numSlices in pyspark/tests.py)

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-1701:
-
Comment: was deleted

(was: ok, and one more

https://github.com/apache/spark/pull/2304 to remove slice terminology from the 
python examples

imho, all 4 of the PRs can be applied to master independently and in any order)

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124590#comment-14124590
 ] 

Apache Spark commented on SPARK-1701:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2305

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124645#comment-14124645
 ] 

Apache Spark commented on SPARK-1981:
-

User 'cfregly' has created a pull request for this issue:
https://github.com/apache/spark/pull/2306

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.1.0


 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2419) Misc updates to streaming programming guide

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124664#comment-14124664
 ] 

Apache Spark commented on SPARK-2419:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/2307

 Misc updates to streaming programming guide
 ---

 Key: SPARK-2419
 URL: https://issues.apache.org/jira/browse/SPARK-2419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 This JIRA collects together a number of small issues that should be added to 
 the streaming programming guide
 - Receivers consume an executor slot and highlight the fact the # cores  # 
 receivers is necessary
 - Classes of spark-streaming-XYZ cannot be access from Spark Shell
 - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
 and its dependencies to be packaged with application JAR
 - Ordering and parallelism of the output operations
 - Receiver's should be serializable
 - Add more information on how socketStream: input stream = iterator function.
 - New Flume and Kinesis stuff.
 - Twitter4j version
 - Design pattern: creating connections to external sinks
 - Design pattern: multiple input streams



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3406) Python persist API does not have a default storage level

2014-09-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3406.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed by Holden in https://github.com/apache/spark/pull/2280

 Python persist API does not have a default storage level
 

 Key: SPARK-3406
 URL: https://issues.apache.org/jira/browse/SPARK-3406
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: holdenk
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 5m
  Remaining Estimate: 5m

 PySpark's persist method on RDD's does not have a default storage level. This 
 is different than the Scala API which defaults to in memory caching. This is 
 minor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3397) Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT

2014-09-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3397.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2268
[https://github.com/apache/spark/pull/2268]

 Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
 --

 Key: SPARK-3397
 URL: https://issues.apache.org/jira/browse/SPARK-3397
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Guoqiang Li
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3301) The spark version in the welcome message of pyspark is not correct

2014-09-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3301.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

 The spark version in the welcome message of pyspark is not correct
 --

 Key: SPARK-3301
 URL: https://issues.apache.org/jira/browse/SPARK-3301
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3273) We should read the version information from the same place.

2014-09-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3273.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2175
[https://github.com/apache/spark/pull/2175]

 We should read the version information from the same place.
 ---

 Key: SPARK-3273
 URL: https://issues.apache.org/jira/browse/SPARK-3273
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3321) Defining a class within python main script

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124704#comment-14124704
 ] 

Matthew Farrellee commented on SPARK-3321:
--

this has come up a few times. it's not a problem with spark, but rather an 
artifact of how python operates.

do you have a specific suggestion on how the python interface to spark could 
work around this python limitation automatically?

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Critical

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 

[jira] [Commented] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124705#comment-14124705
 ] 

Matthew Farrellee commented on SPARK-3401:
--

nice catch

 Wrong usage of tee command in python/run-tests
 --

 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.1


 In python/run-test, tee command is used with -a option to append 
 unit-tests.log for logging but the usage is wrong.
 In current implementation, the output of tee command is redirected to 
 unit-tests.log like tee -a  unit-tests.log.
 tee command is not needed to redirect its output.
 This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-3401:
-
Fix Version/s: 1.1.1

 Wrong usage of tee command in python/run-tests
 --

 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.1


 In python/run-test, tee command is used with -a option to append 
 unit-tests.log for logging but the usage is wrong.
 In current implementation, the output of tee command is redirected to 
 unit-tests.log like tee -a  unit-tests.log.
 tee command is not needed to redirect its output.
 This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-3401.
--
Resolution: Fixed

 Wrong usage of tee command in python/run-tests
 --

 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.1


 In python/run-test, tee command is used with -a option to append 
 unit-tests.log for logging but the usage is wrong.
 In current implementation, the output of tee command is redirected to 
 unit-tests.log like tee -a  unit-tests.log.
 tee command is not needed to redirect its output.
 This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working

2014-09-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2769:
---
Affects Version/s: (was: 1.0.0)
   1.1.0

 Ganglia Support Broken / Not working
 

 Key: SPARK-2769
 URL: https://issues.apache.org/jira/browse/SPARK-2769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Linux Red Hat 6.4 on Spark 1.1.0
Reporter: Stephen Walsh
  Labels: Ganglia, GraphiteSink,, Metrics

 Hi all,
 I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0
 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) 
 is installed.
 I've added the following to the metrics.properties
 *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
 *.sink.graphite.host=HOSTNAME
 *.sink.graphite.port=8649
 *.sink.graphite.period=1
 *.sink.graphite.prefix=aa
 and I get this error message
 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
 java.net.SocketException: Broken pipe
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at 
 java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
 at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
 at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
 at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
 at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
 at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
 at java.io.BufferedWriter.flush(BufferedWriter.java:254)
 at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
 at 
 com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
 at 
 com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
 at 
 com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107)
 at 
 com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 From looking at the code I see the following.
   val graphite: Graphite = new Graphite(new InetSocketAddress(host, port))
   val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry)
   .convertDurationsTo(TimeUnit.MILLISECONDS)
   .convertRatesTo(TimeUnit.SECONDS)
   .prefixedWith(prefix)
   .build(graphite)
 https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69
 Followed by
 override def start() {
 reporter.start(pollPeriod, pollUnit)
   }
 I noticed that the error fails when we first fry to send a message but 
 nowhere do I see  graphite.connect() being called?
 https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62
 as it seems to fail on the send function..
 https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77
 a with this.writer not initialized the writer.write will fail.
 The GraphiteBuilder doesn't call it either when creating the reporter 
 object.
 https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113
 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but 
 very little logging has me thinking it is a bug.
 EDIT:
 found out where the connect gets called.
 https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L153
 ad his is called from  here
 https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98
 which is called form here
 

[jira] [Created] (SPARK-3427) Standalone version of static PageRank

2014-09-06 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-3427:
-

 Summary: Standalone version of static PageRank
 Key: SPARK-3427
 URL: https://issues.apache.org/jira/browse/SPARK-3427
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave


GraphX's current implementation of static (fixed iteration count) PageRank is 
implemented using the Pregel API. This unnecessarily tracks active vertices, 
even though in static PageRank all vertices are always active. Active vertex 
tracking incurs the following costs:

1. A shuffle per iteration to ship the active sets to the edge partitions.
2. A hash table creation per iteration at each partition to index the active 
sets for lookup.
3. A hash lookup per edge to check whether the source vertex is active.

I reimplemented static PageRank using the GraphX API instead of the 
higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this 
provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 
iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3427) Avoid unnecessary active vertex tracking in static PageRank

2014-09-06 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3427:
--
Summary: Avoid unnecessary active vertex tracking in static PageRank  (was: 
Standalone version of static PageRank)

 Avoid unnecessary active vertex tracking in static PageRank
 ---

 Key: SPARK-3427
 URL: https://issues.apache.org/jira/browse/SPARK-3427
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 GraphX's current implementation of static (fixed iteration count) PageRank is 
 implemented using the Pregel API. This unnecessarily tracks active vertices, 
 even though in static PageRank all vertices are always active. Active vertex 
 tracking incurs the following costs:
 1. A shuffle per iteration to ship the active sets to the edge partitions.
 2. A hash table creation per iteration at each partition to index the active 
 sets for lookup.
 3. A hash lookup per edge to check whether the source vertex is active.
 I reimplemented static PageRank using the GraphX API instead of the 
 higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this 
 provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 
 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3427) Avoid active vertex tracking in static PageRank

2014-09-06 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3427:
--
Description: 
GraphX's current implementation of static (fixed iteration count) PageRank uses 
the Pregel API. This unnecessarily tracks active vertices, even though in 
static PageRank all vertices are always active. Active vertex tracking incurs 
the following costs:

1. A shuffle per iteration to ship the active sets to the edge partitions.
2. A hash table creation per iteration at each partition to index the active 
sets for lookup.
3. A hash lookup per edge to check whether the source vertex is active.

I reimplemented static PageRank using the lower-level GraphX API instead of the 
Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% 
speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank 
on a synthetic graph with 10M vertices and 1.27B edges.

  was:
GraphX's current implementation of static (fixed iteration count) PageRank is 
implemented using the Pregel API. This unnecessarily tracks active vertices, 
even though in static PageRank all vertices are always active. Active vertex 
tracking incurs the following costs:

1. A shuffle per iteration to ship the active sets to the edge partitions.
2. A hash table creation per iteration at each partition to index the active 
sets for lookup.
3. A hash lookup per edge to check whether the source vertex is active.

I reimplemented static PageRank using the GraphX API instead of the 
higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this 
provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 
iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.


 Avoid active vertex tracking in static PageRank
 ---

 Key: SPARK-3427
 URL: https://issues.apache.org/jira/browse/SPARK-3427
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 GraphX's current implementation of static (fixed iteration count) PageRank 
 uses the Pregel API. This unnecessarily tracks active vertices, even though 
 in static PageRank all vertices are always active. Active vertex tracking 
 incurs the following costs:
 1. A shuffle per iteration to ship the active sets to the edge partitions.
 2. A hash table creation per iteration at each partition to index the active 
 sets for lookup.
 3. A hash lookup per edge to check whether the source vertex is active.
 I reimplemented static PageRank using the lower-level GraphX API instead of 
 the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided 
 a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of 
 PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3427) Avoid active vertex tracking in static PageRank

2014-09-06 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3427:
--
Summary: Avoid active vertex tracking in static PageRank  (was: Avoid 
unnecessary active vertex tracking in static PageRank)

 Avoid active vertex tracking in static PageRank
 ---

 Key: SPARK-3427
 URL: https://issues.apache.org/jira/browse/SPARK-3427
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 GraphX's current implementation of static (fixed iteration count) PageRank is 
 implemented using the Pregel API. This unnecessarily tracks active vertices, 
 even though in static PageRank all vertices are always active. Active vertex 
 tracking incurs the following costs:
 1. A shuffle per iteration to ship the active sets to the edge partitions.
 2. A hash table creation per iteration at each partition to index the active 
 sets for lookup.
 3. A hash lookup per edge to check whether the source vertex is active.
 I reimplemented static PageRank using the GraphX API instead of the 
 higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this 
 provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 
 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3427) Avoid active vertex tracking in static PageRank

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124732#comment-14124732
 ] 

Apache Spark commented on SPARK-3427:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/2308

 Avoid active vertex tracking in static PageRank
 ---

 Key: SPARK-3427
 URL: https://issues.apache.org/jira/browse/SPARK-3427
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 GraphX's current implementation of static (fixed iteration count) PageRank 
 uses the Pregel API. This unnecessarily tracks active vertices, even though 
 in static PageRank all vertices are always active. Active vertex tracking 
 incurs the following costs:
 1. A shuffle per iteration to ship the active sets to the edge partitions.
 2. A hash table creation per iteration at each partition to index the active 
 sets for lookup.
 3. A hash lookup per edge to check whether the source vertex is active.
 I reimplemented static PageRank using the lower-level GraphX API instead of 
 the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided 
 a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of 
 PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)

2014-09-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3353.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Stage id monotonicity (parent stage should have lower stage id)
 ---

 Key: SPARK-3353
 URL: https://issues.apache.org/jira/browse/SPARK-3353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.2.0


 The way stage IDs are generated is that parent stages actually have higher 
 stage id. This is very confusing because parent stages get scheduled  
 executed first.
 We should reverse that order so the scheduling timeline of stages (absent of 
 failures) is monotonic, i.e. stages that are executed first have lower stage 
 ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero

2014-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124744#comment-14124744
 ] 

Apache Spark commented on SPARK-3178:
-

User 'bbejeck' has created a pull request for this issue:
https://github.com/apache/spark/pull/2309

 setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the 
 worker memory limit to zero
 

 Key: SPARK-3178
 URL: https://issues.apache.org/jira/browse/SPARK-3178
 Project: Spark
  Issue Type: Bug
 Environment: osx
Reporter: Jon Haddad
Assignee: Bill Bejeck
  Labels: starter

 This should either default to m or just completely fail.  Starting a worker 
 with zero memory isn't very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3428) TaskMetrics for running tasks is missing GC time metrics

2014-09-06 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-3428:
-

 Summary: TaskMetrics for running tasks is missing GC time metrics
 Key: SPARK-3428
 URL: https://issues.apache.org/jira/browse/SPARK-3428
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash
Assignee: Sandy Ryza


SPARK-2099 added the ability to update helpful metrics like shuffle bytes 
read/written on the webui via a periodic heartbeat from executors.  It omitted 
GC time metrics though.

This ticket is for including GC times in the heartbeats.

See the updateAggregateMetrics method here:
https://github.com/apache/spark/pull/1056/files#diff-1f32bcb61f51133bd0959a4177a066a5R175



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2099) Report TaskMetrics for running tasks

2014-09-06 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124777#comment-14124777
 ] 

Andrew Ash commented on SPARK-2099:
---

Added a followon ticket for GC times as SPARK-3428

 Report TaskMetrics for running tasks
 

 Key: SPARK-2099
 URL: https://issues.apache.org/jira/browse/SPARK-2099
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.1.0


 Spark currently collects a set of helpful task metrics, like shuffle bytes 
 written, GC time, and displays them on the app web UI.  These are only 
 collected and displayed for tasks that have completed.  This makes them 
 unsuited to perhaps the situation where they would be most useful - 
 determining what's going wrong in currently running tasks.
 Reporting metrics progrss for running tasks would probably require adding an 
 executor-driver heartbeat that reports metrics for all tasks currently 
 running on the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org