date:20140617


 [ 
https://issues.apache.org/jira/browse/SPARK-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2130.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1096
[https://github.com/apache/spark/pull/1096]

 Clarify PySpark docs for RDD.getStorageLevel
 

 Key: SPARK-2130
 URL: https://issues.apache.org/jira/browse/SPARK-2130
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Assignee: Kan Zhang
Priority: Minor
 Fix For: 1.1.0


 The [PySpark docs for 
 RDD.getStorageLevel|http://spark.apache.org/docs/1.0.0/api/python/pyspark.rdd.RDD-class.html#getStorageLevel]
  are unclear.
 {quote}
  rdd1 = sc.parallelize([1,2]) 
  rdd1.getStorageLevel() 
 StorageLevel(False, False, False, False, 1)
 {quote}
 What do the 5 values of False, False, False, False, 1 mean?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7


 [ 
https://issues.apache.org/jira/browse/SPARK-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1990:
---

Assignee: Anant Daksh Asthana

 spark-ec2 should only need Python 2.6, not 2.7
 --

 Key: SPARK-1990
 URL: https://issues.apache.org/jira/browse/SPARK-1990
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Anant Daksh Asthana
  Labels: Starter
 Fix For: 1.0.1, 1.1.0


 There were some posts on the lists that spark-ec2 does not work with Python 
 2.6. In addition, we should check the Python version at the top of the script 
 and exit if it's too old.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7


 [ 
https://issues.apache.org/jira/browse/SPARK-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1990:
---

Fix Version/s: 0.9.2

 spark-ec2 should only need Python 2.6, not 2.7
 --

 Key: SPARK-1990
 URL: https://issues.apache.org/jira/browse/SPARK-1990
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Anant Daksh Asthana
  Labels: Starter
 Fix For: 0.9.2, 1.0.1, 1.1.0


 There were some posts on the lists that spark-ec2 does not work with Python 
 2.6. In addition, we should check the Python version at the top of the script 
 and exit if it's too old.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2035) Make a stage's call stack available on the UI


 [ 
https://issues.apache.org/jira/browse/SPARK-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2035:
---

Assignee: Daniel Darabos

 Make a stage's call stack available on the UI
 -

 Key: SPARK-2035
 URL: https://issues.apache.org/jira/browse/SPARK-2035
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Daniel Darabos
Assignee: Daniel Darabos
Priority: Minor
 Fix For: 1.1.0

 Attachments: example-html.tgz


 Currently the stage table displays the file name and line number that is the 
 call site that triggered the given stage. This is enormously useful for 
 understanding the execution. But once a project adds utility classes and 
 other indirections, the call site can become less meaningful, because the 
 interesting line is further up the stack.
 An idea to fix this is to display the entire call stack that triggered the 
 stage. It would be collapsed by default and could be revealed with a click.
 I have started working on this. It is a good way to learn about how the RDD 
 interface ties into the UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2035) Make a stage's call stack available on the UI


 [ 
https://issues.apache.org/jira/browse/SPARK-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2035.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 981
[https://github.com/apache/spark/pull/981]

 Make a stage's call stack available on the UI
 -

 Key: SPARK-2035
 URL: https://issues.apache.org/jira/browse/SPARK-2035
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Daniel Darabos
Priority: Minor
 Fix For: 1.1.0

 Attachments: example-html.tgz


 Currently the stage table displays the file name and line number that is the 
 call site that triggered the given stage. This is enormously useful for 
 understanding the execution. But once a project adds utility classes and 
 other indirections, the call site can become less meaningful, because the 
 interesting line is further up the stack.
 An idea to fix this is to display the entire call stack that triggered the 
 stage. It would be collapsed by default and could be revealed with a click.
 I have started working on this. It is a good way to learn about how the RDD 
 interface ties into the UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2155) Support effectful / non-deterministic key expressions in CASE WHEN statements


 [ 
https://issues.apache.org/jira/browse/SPARK-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2155:


Priority: Minor  (was: Major)

 Support effectful / non-deterministic key expressions in CASE WHEN statements
 -

 Key: SPARK-2155
 URL: https://issues.apache.org/jira/browse/SPARK-2155
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zongheng Yang
Priority: Minor

 Currently we translate CASE KEY WHEN to CASE WHEN, hence incurring redundant 
 evaluations of the key expression. Relevant discussions here: 
 https://github.com/apache/spark/pull/1055/files#r13784248
 If we are very in need of support for effectful key expressions, at least we 
 can resort to the baseline approach of having both CaseWhen and CaseKeyWhen 
 as expressions, which seem to introduce much code duplication (e.g. see 
 https://github.com/concretevitamin/spark/blob/47d406a58d129e5bba68bfadf9dd1faa9054d834/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L216
  for a sketch implementation). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2160) error of Decision tree algorithm in Spark MLlib

2014-06-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033550#comment-14033550
 ] 

Sean Owen commented on SPARK-2160:
--

You already added this as https://issues.apache.org/jira/browse/SPARK-2152 
right?

 error of  Decision tree algorithm  in Spark MLlib 
 --

 Key: SPARK-2160
 URL: https://issues.apache.org/jira/browse/SPARK-2160
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: caoli
  Labels: patch
 Fix For: 1.1.0

   Original Estimate: 4h
  Remaining Estimate: 4h

 the error of comput rightNodeAgg about  Decision tree algorithm  in Spark 
 MLlib  , in the function extractLeftRightNodeAggregates() ,when compute 
 rightNodeAgg  used bindata index is error. in the DecisionTree.scala file 
 about  Line980:
  rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
 binData(shift + (2 * (numBins - 2 - splitIndex))) +
   rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))  
   
  the   binData(shift + (2 * (numBins - 2 - splitIndex)))  index compute is 
 error, so the result of rightNodeAgg  include  repeated data about bins  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2144) SparkUI Executors tab displays incorrect RDD blocks


 [ 
https://issues.apache.org/jira/browse/SPARK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2144:
---

Component/s: (was: Spark Core)

 SparkUI Executors tab displays incorrect RDD blocks
 ---

 Key: SPARK-2144
 URL: https://issues.apache.org/jira/browse/SPARK-2144
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.0.1, 1.1.0


 If a block is dropped because of memory pressure, this is not reflected in 
 the RDD Blocks column on the Executors page.
 This is because StorageStatusListener updates the StorageLevel of the dropped 
 block to StorageLevel.None, but does not remove it from the list. This is a 
 simple fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2144) SparkUI Executors tab displays incorrect RDD blocks


 [ 
https://issues.apache.org/jira/browse/SPARK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2144:
---

Assignee: Andrew Or

 SparkUI Executors tab displays incorrect RDD blocks
 ---

 Key: SPARK-2144
 URL: https://issues.apache.org/jira/browse/SPARK-2144
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.0.1, 1.1.0


 If a block is dropped because of memory pressure, this is not reflected in 
 the RDD Blocks column on the Executors page.
 This is because StorageStatusListener updates the StorageLevel of the dropped 
 block to StorageLevel.None, but does not remove it from the list. This is a 
 simple fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2144) SparkUI Executors tab displays incorrect RDD blocks


 [ 
https://issues.apache.org/jira/browse/SPARK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2144.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1080
[https://github.com/apache/spark/pull/1080]

 SparkUI Executors tab displays incorrect RDD blocks
 ---

 Key: SPARK-2144
 URL: https://issues.apache.org/jira/browse/SPARK-2144
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.0.1, 1.1.0


 If a block is dropped because of memory pressure, this is not reflected in 
 the RDD Blocks column on the Executors page.
 This is because StorageStatusListener updates the StorageLevel of the dropped 
 block to StorageLevel.None, but does not remove it from the list. This is a 
 simple fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1353) IllegalArgumentException when writing to disk

2014-06-17 Thread Gavin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033607#comment-14033607
 ] 

Gavin commented on SPARK-1353:
--

I have the same problem. spark-assembly-1.0.0-hadoop2.2.0 .


 IllegalArgumentException when writing to disk
 -

 Key: SPARK-1353
 URL: https://issues.apache.org/jira/browse/SPARK-1353
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
 Environment: AWS EMR 3.2.30-49.59.amzn1.x86_64 #1 SMP  x86_64 
 GNU/Linux
 Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4 built 2014-03-18
Reporter: Jim Blomo
Priority: Minor

 The Executor may fail when trying to mmap a file bigger than 
 Integer.MAX_VALUE due to the constraints of FileChannel.map 
 (http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,
  long, long)).  The signature takes longs, but the size value must be less 
 than MAX_VALUE.  This manifests with the following backtrace:
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:98)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:337)
 at 
 org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:281)
 at org.apache.spark.storage.BlockManager.get(BlockManager.scala:430)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:38)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
 at 
 org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2163) Change ``setConvergenceTol'' with a parameter of type Double instead of Int

2014-06-17 Thread Gang Bai (JIRA)

Gang Bai created SPARK-2163:
---

 Summary: Change ``setConvergenceTol'' with a parameter of type 
Double instead of Int
 Key: SPARK-2163
 URL: https://issues.apache.org/jira/browse/SPARK-2163
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Gang Bai


The class LBFGS in mllib.optimization currently provides a 
{{setConvergenceTol(tolerance: Int)}} method for setting the convergence 
tolerance. The tolerance parameter is of type {{Int}}. The specified tolerance 
is then used as parameter in calling {{LBFGS.runLBFGS}}, where the parameter 
{{convergenceTol}} is of type {{Double}}.

The Int parameter may cause problem when one creates an optimizer and sets a 
Double-valued tolerance. e.g:

{code:borderStyle=solid}
override val optimizer = new LBFGS(gradient, updater)
  .setNumCorrections(9)
  .setConvergenceTol(1e-4)  // *type mismatch here*
  .setMaxNumIterations(100)
  .setRegParam(1.0)
{code}


IMHO there is no need to make the tolerance of type Int. Let's change it into a 
Double parameter and eliminate the type mismatch problem.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2163) Set ``setConvergenceTol'' with a parameter of type Double instead of Int

2014-06-17 Thread Gang Bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Bai updated SPARK-2163:


Summary: Set ``setConvergenceTol'' with a parameter of type Double instead 
of Int  (was: Change ``setConvergenceTol'' with a parameter of type Double 
instead of Int)

 Set ``setConvergenceTol'' with a parameter of type Double instead of Int
 

 Key: SPARK-2163
 URL: https://issues.apache.org/jira/browse/SPARK-2163
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Gang Bai

 The class LBFGS in mllib.optimization currently provides a 
 {{setConvergenceTol(tolerance: Int)}} method for setting the convergence 
 tolerance. The tolerance parameter is of type {{Int}}. The specified 
 tolerance is then used as parameter in calling {{LBFGS.runLBFGS}}, where the 
 parameter {{convergenceTol}} is of type {{Double}}.
 The Int parameter may cause problem when one creates an optimizer and sets a 
 Double-valued tolerance. e.g:
 {code:borderStyle=solid}
 override val optimizer = new LBFGS(gradient, updater)
   .setNumCorrections(9)
   .setConvergenceTol(1e-4)  // *type mismatch here*
   .setMaxNumIterations(100)
   .setRegParam(1.0)
 {code}
 IMHO there is no need to make the tolerance of type Int. Let's change it into 
 a Double parameter and eliminate the type mismatch problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1353) IllegalArgumentException when writing to disk

2014-06-17 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033625#comment-14033625
 ] 

Mridul Muralidharan commented on SPARK-1353:


This is due to limitation in spark which is being addressed in 
https://issues.apache.org/jira/browse/SPARK-1476.

 IllegalArgumentException when writing to disk
 -

 Key: SPARK-1353
 URL: https://issues.apache.org/jira/browse/SPARK-1353
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
 Environment: AWS EMR 3.2.30-49.59.amzn1.x86_64 #1 SMP  x86_64 
 GNU/Linux
 Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4 built 2014-03-18
Reporter: Jim Blomo
Priority: Minor

 The Executor may fail when trying to mmap a file bigger than 
 Integer.MAX_VALUE due to the constraints of FileChannel.map 
 (http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,
  long, long)).  The signature takes longs, but the size value must be less 
 than MAX_VALUE.  This manifests with the following backtrace:
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:98)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:337)
 at 
 org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:281)
 at org.apache.spark.storage.BlockManager.get(BlockManager.scala:430)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:38)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
 at 
 org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2164) Applying UDF on a struct throws a MatchError


 [ 
https://issues.apache.org/jira/browse/SPARK-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2164.
-

Resolution: Fixed

Fixed by: https://github.com/apache/spark/pull/796

 Applying UDF on a struct throws a MatchError
 

 Key: SPARK-2164
 URL: https://issues.apache.org/jira/browse/SPARK-2164
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.0.1, 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2164) Applying UDF on a struct throws a MatchError

Michael Armbrust created SPARK-2164:
---

 Summary: Applying UDF on a struct throws a MatchError
 Key: SPARK-2164
 URL: https://issues.apache.org/jira/browse/SPARK-2164
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.0.1, 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2053) Add Catalyst expression for CASE WHEN


 [ 
https://issues.apache.org/jira/browse/SPARK-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2053.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

 Add Catalyst expression for CASE WHEN
 -

 Key: SPARK-2053
 URL: https://issues.apache.org/jira/browse/SPARK-2053
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
 Fix For: 1.0.1, 1.1.0


 Here's a rough start: 
 https://github.com/marmbrus/spark/commit/1209daaf49b0a87e7f68f89c79d02b446e624db3



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2163) Set ``setConvergenceTol'' with a parameter of type Double instead of Int

2014-06-17 Thread Gang Bai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033703#comment-14033703
 ] 

Gang Bai commented on SPARK-2163:
-

I've created a pull request on GitHub for this. 
https://github.com/apache/spark/pull/1104

 Set ``setConvergenceTol'' with a parameter of type Double instead of Int
 

 Key: SPARK-2163
 URL: https://issues.apache.org/jira/browse/SPARK-2163
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Gang Bai

 The class LBFGS in mllib.optimization currently provides a 
 {{setConvergenceTol(tolerance: Int)}} method for setting the convergence 
 tolerance. The tolerance parameter is of type {{Int}}. The specified 
 tolerance is then used as parameter in calling {{LBFGS.runLBFGS}}, where the 
 parameter {{convergenceTol}} is of type {{Double}}.
 The Int parameter may cause problem when one creates an optimizer and sets a 
 Double-valued tolerance. e.g:
 {code:borderStyle=solid}
 override val optimizer = new LBFGS(gradient, updater)
   .setNumCorrections(9)
   .setConvergenceTol(1e-4)  // *type mismatch here*
   .setMaxNumIterations(100)
   .setRegParam(1.0)
 {code}
 IMHO there is no need to make the tolerance of type Int. Let's change it into 
 a Double parameter and eliminate the type mismatch problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1471) Worker not recognize Driver state at standalone mode

2014-06-17 Thread Federico Ragona (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033938#comment-14033938
 ] 

Federico Ragona commented on SPARK-1471:


Hello,
I'm facing the same issue in version 1.0.0 (built from the sources distribution 
using {{make-distribution.sh --hadoop 2.0.0-cdh4.7.0}}).

I'm running a job using the new {{bin/spark-submit}} script. When the job 
fails, one of the worker dies with the following error:

{code}
2014-06-17 17:00:04,675 [sparkWorker-akka.actor.default-dispatcher-3] ERROR 
akka.actor.OneForOneStrategy - FAILED (of class scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

 Worker not  recognize Driver state at standalone mode
 -

 Key: SPARK-1471
 URL: https://issues.apache.org/jira/browse/SPARK-1471
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: standalone
Reporter: shenhong

 When I run a spark job in standalone,
 ./bin/spark-class org.apache.spark.deploy.Client  launch 
 spark://v125050024.bja:7077 
 file:///home/yuling.sh/spark-0.9.0-incubating/examples/target/spark-examples_2.10-0.9.0-incubating.jar
  org.apache.spark.examples.SparkPi
  Here is the Worker log.
 14/04/11 11:15:04 ERROR OneForOneStrategy: FAILED (of class 
 scala.Enumeration$Val)
 scala.MatchError: FAILED (of class scala.Enumeration$Val)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1471) Worker not recognize Driver state at standalone mode

2014-06-17 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033984#comment-14033984
 ] 

Nan Zhu commented on SPARK-1471:


I will fix it right now

 Worker not  recognize Driver state at standalone mode
 -

 Key: SPARK-1471
 URL: https://issues.apache.org/jira/browse/SPARK-1471
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: standalone
Reporter: shenhong

 When I run a spark job in standalone,
 ./bin/spark-class org.apache.spark.deploy.Client  launch 
 spark://v125050024.bja:7077 
 file:///home/yuling.sh/spark-0.9.0-incubating/examples/target/spark-examples_2.10-0.9.0-incubating.jar
  org.apache.spark.examples.SparkPi
  Here is the Worker log.
 14/04/11 11:15:04 ERROR OneForOneStrategy: FAILED (of class 
 scala.Enumeration$Val)
 scala.MatchError: FAILED (of class scala.Enumeration$Val)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1471) Worker not recognize Driver state at standalone mode

2014-06-17 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033986#comment-14033986
 ] 

Nan Zhu commented on SPARK-1471:


this has been fixed by 
https://github.com/apache/spark/commit/95e4c9c6fb153b7f0aa4c442c4bdb6552d326640

 Worker not  recognize Driver state at standalone mode
 -

 Key: SPARK-1471
 URL: https://issues.apache.org/jira/browse/SPARK-1471
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: standalone
Reporter: shenhong

 When I run a spark job in standalone,
 ./bin/spark-class org.apache.spark.deploy.Client  launch 
 spark://v125050024.bja:7077 
 file:///home/yuling.sh/spark-0.9.0-incubating/examples/target/spark-examples_2.10-0.9.0-incubating.jar
  org.apache.spark.examples.SparkPi
  Here is the Worker log.
 14/04/11 11:15:04 ERROR OneForOneStrategy: FAILED (of class 
 scala.Enumeration$Val)
 scala.MatchError: FAILED (of class scala.Enumeration$Val)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1471) Worker not recognize Driver state at standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li resolved SPARK-1471.


Resolution: Fixed

 Worker not  recognize Driver state at standalone mode
 -

 Key: SPARK-1471
 URL: https://issues.apache.org/jira/browse/SPARK-1471
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: standalone
Reporter: shenhong

 When I run a spark job in standalone,
 ./bin/spark-class org.apache.spark.deploy.Client  launch 
 spark://v125050024.bja:7077 
 file:///home/yuling.sh/spark-0.9.0-incubating/examples/target/spark-examples_2.10-0.9.0-incubating.jar
  org.apache.spark.examples.SparkPi
  Here is the Worker log.
 14/04/11 11:15:04 ERROR OneForOneStrategy: FAILED (of class 
 scala.Enumeration$Val)
 scala.MatchError: FAILED (of class scala.Enumeration$Val)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1471) Worker not recognize Driver state at standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1471:
---

Fix Version/s: 1.0.0

 Worker not  recognize Driver state at standalone mode
 -

 Key: SPARK-1471
 URL: https://issues.apache.org/jira/browse/SPARK-1471
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: standalone
Reporter: shenhong
 Fix For: 1.0.1


 When I run a spark job in standalone,
 ./bin/spark-class org.apache.spark.deploy.Client  launch 
 spark://v125050024.bja:7077 
 file:///home/yuling.sh/spark-0.9.0-incubating/examples/target/spark-examples_2.10-0.9.0-incubating.jar
  org.apache.spark.examples.SparkPi
  Here is the Worker log.
 14/04/11 11:15:04 ERROR OneForOneStrategy: FAILED (of class 
 scala.Enumeration$Val)
 scala.MatchError: FAILED (of class scala.Enumeration$Val)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1471) Worker not recognize Driver state at standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1471:
---

Fix Version/s: (was: 1.0.0)
   1.0.1

 Worker not  recognize Driver state at standalone mode
 -

 Key: SPARK-1471
 URL: https://issues.apache.org/jira/browse/SPARK-1471
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: standalone
Reporter: shenhong
 Fix For: 1.0.1


 When I run a spark job in standalone,
 ./bin/spark-class org.apache.spark.deploy.Client  launch 
 spark://v125050024.bja:7077 
 file:///home/yuling.sh/spark-0.9.0-incubating/examples/target/spark-examples_2.10-0.9.0-incubating.jar
  org.apache.spark.examples.SparkPi
  Here is the Worker log.
 14/04/11 11:15:04 ERROR OneForOneStrategy: FAILED (of class 
 scala.Enumeration$Val)
 scala.MatchError: FAILED (of class scala.Enumeration$Val)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1471) Worker not recognize Driver state at standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1471:
---

Fix Version/s: 1.1.0

 Worker not  recognize Driver state at standalone mode
 -

 Key: SPARK-1471
 URL: https://issues.apache.org/jira/browse/SPARK-1471
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: standalone
Reporter: shenhong
 Fix For: 1.0.1, 1.1.0


 When I run a spark job in standalone,
 ./bin/spark-class org.apache.spark.deploy.Client  launch 
 spark://v125050024.bja:7077 
 file:///home/yuling.sh/spark-0.9.0-incubating/examples/target/spark-examples_2.10-0.9.0-incubating.jar
  org.apache.spark.examples.SparkPi
  Here is the Worker log.
 14/04/11 11:15:04 ERROR OneForOneStrategy: FAILED (of class 
 scala.Enumeration$Val)
 scala.MatchError: FAILED (of class scala.Enumeration$Val)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-06-17 Thread Ryan Fishel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034036#comment-14034036
 ] 

Ryan Fishel commented on SPARK-2058:


Was Eugen's fix implemented?

 SPARK_CONF_DIR should override all present configs
 --

 Key: SPARK-2058
 URL: https://issues.apache.org/jira/browse/SPARK-2058
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1, 1.1.0
Reporter: Eugen Cepoi
Priority: Trivial
 Fix For: 1.0.1, 1.1.0


 When the user defines SPARK_CONF_DIR I think spark should use all the configs 
 available there not only spark-env.
 This involves changing SparkSubmitArguments to first read from 
 SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
 computed classpath for configs such as log4j, metrics, etc.
 I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell


 [ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1199:
---

Fix Version/s: (was: 1.1.0)

 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Assignee: Prashant Sharma
Priority: Blocker

 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell


 [ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-1199:
--

Assignee: Prashant Sharma

Prashant said he could look into this - so I'm assigning it to him.

 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Assignee: Prashant Sharma
Priority: Blocker

 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell


 [ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1199:
---

Target Version/s: 1.0.1, 1.1.0

 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Assignee: Prashant Sharma
Priority: Blocker

 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2157) Can't write tight firewall rules for Spark

2014-06-17 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034135#comment-14034135
 ] 

Andrew Ash commented on SPARK-2157:
---

I pulled together Egor's work for HttpBroadcast and HttpFileServer and added 
configuration options for the block manager and the repl class server in this 
PR: https://github.com/apache/spark/pull/1107

 Can't write tight firewall rules for Spark
 --

 Key: SPARK-2157
 URL: https://issues.apache.org/jira/browse/SPARK-2157
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Priority: Critical

 In order to run Spark in places with strict firewall rules, you need to be 
 able to specify every port that's used between all parts of the stack.
 Per the [network activity section of the 
 docs|http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security]
  most of the ports are configurable, but there are a few ports that aren't 
 configurable.
 We need to make every port configurable to a particular port, so that we can 
 run Spark in highly locked-down environments.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2165) spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext

2014-06-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-2165:


 Summary: spark on yarn: add support for setting maxAppAttempts in 
the ApplicationSubmissionContext
 Key: SPARK-2165
 URL: https://issues.apache.org/jira/browse/SPARK-2165
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves


Hadoop 2.x adds support for allowing the application to specify the maximum 
application attempts. We should add support for it by setting in the 
ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2166) Enumerating instances to be terminated before the prompting the users to continue.

2014-06-17 Thread Jean-Martin Archer (JIRA)

Jean-Martin Archer created SPARK-2166:
-

 Summary: Enumerating instances to be terminated before the 
prompting the users to continue.
 Key: SPARK-2166
 URL: https://issues.apache.org/jira/browse/SPARK-2166
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0
Reporter: Jean-Martin Archer
Priority: Minor


When destroying a cluster, the user will be prompted for confirmation without 
first showing which instances will be terminated.

Pull Request: https://github.com/apache/spark/pull/270#issuecomment-46341975

This pull request will list the EC2 instances before destroying the cluster.
This was added because it can be scary to destroy EC2
instances without knowing which one will be affected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1907) spark-submit: add exec at the end of the script

2014-06-17 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1907:
-

Assignee: Colin Patrick McCabe

 spark-submit: add exec at the end of the script
 ---

 Key: SPARK-1907
 URL: https://issues.apache.org/jira/browse/SPARK-1907
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor

 Add an 'exec' at the end of the spark-submit script, to avoid keeping a bash 
 process hanging around while it runs.  This makes ps look a little bit nicer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2167) spark-submit should return exit code based on failure/success

2014-06-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-2167:


 Summary: spark-submit should return exit code based on 
failure/success
 Key: SPARK-2167
 URL: https://issues.apache.org/jira/browse/SPARK-2167
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Thomas Graves
 Fix For: 1.1.0


spark-submit script and Java class should exit with 0 for success and non-zero 
with failure so that other command line tools and workflow managers (like 
oozie) can properly tell if the spark app succeeded or failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1907) spark-submit: add exec at the end of the script

2014-06-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1907.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

 spark-submit: add exec at the end of the script
 ---

 Key: SPARK-1907
 URL: https://issues.apache.org/jira/browse/SPARK-1907
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Fix For: 1.0.1, 1.1.0


 Add an 'exec' at the end of the spark-submit script, to avoid keeping a bash 
 process hanging around while it runs.  This makes ps look a little bit nicer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true


[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034294#comment-14034294
 ] 

Sebastien Rainville commented on SPARK-2022:


I'm seeing the same behavior when trying to set spark.executor.extraLibraryPath:

in conf/spark-defaults.conf:

spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native

the error message in stderr:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:00:55.592289 27091 fetcher.cpp:73] Fetching URI 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:00:55.592428 27091 fetcher.cpp:99] Downloading resource from 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz' to 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:01:05.170714 27091 fetcher.cpp:61] Extracted resource 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
 into 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:01:06.105363 27166 exec.cpp:131] Version: 0.18.0
I0617 16:01:06.112191 27175 exec.cpp:205] Executor registered on slave 
201311011608-1369465866-5050-9189-86
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NumberFormatException: For input string: 
ca1-dcc1-0106.lab.mtl
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)

here is the command in stdout:


Registered executor on ca1-dcc1-0106.lab.mtl
Starting task 9
Forked command at 27178
sh -c 'cd spark-1*; ./bin/spark-class 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
-Djava.library.path=/usr/lib/hadoop/lib/native 
akka.tcp://sp...@ca1-dcc1-0071.lab.mtl:32789/user/CoarseGrainedScheduler 
201311011608-1369465866-5050-9189-86 ca1-dcc1-0106.lab.mtl 1'
Command exited with status 1 (pid: 27178)


 Spark 1.0.0 is failing if mesos.coarse set to true
 --

 Key: SPARK-2022
 URL: https://issues.apache.org/jira/browse/SPARK-2022
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Marek Wiewiorka
Priority: Critical

 more stderr
 ---
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 
 201405220917-134217738-5050-27119-0
 Exception in thread main java.lang.NumberFormatException: For input string: 
 sparkseq003.cloudapp.net
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:492)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
 at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 more stdout
 ---
 Registered executor on sparkseq003.cloudapp.net
 Starting task 5
 Forked command at 61202
 sh -c '/home/mesos/spark-1.0.0/bin/spark-class 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 -Dspark.mesos.coarse=true 
 akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
 rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 
 4'
 Command exited with status 1 (pid: 61202)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true


[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034294#comment-14034294
 ] 

Sebastien Rainville edited comment on SPARK-2022 at 6/17/14 8:08 PM:
-

I'm seeing the same behavior when trying to set spark.executor.extraLibraryPath:

in conf/spark-defaults.conf:

{noformat}
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native
{noformat}

the error message in stderr:

{noformat}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:00:55.592289 27091 fetcher.cpp:73] Fetching URI 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:00:55.592428 27091 fetcher.cpp:99] Downloading resource from 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz' to 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:01:05.170714 27091 fetcher.cpp:61] Extracted resource 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
 into 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:01:06.105363 27166 exec.cpp:131] Version: 0.18.0
I0617 16:01:06.112191 27175 exec.cpp:205] Executor registered on slave 
201311011608-1369465866-5050-9189-86
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NumberFormatException: For input string: 
ca1-dcc1-0106.lab.mtl
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
{noformat}

here is the command in stdout:

{noformat}
Registered executor on ca1-dcc1-0106.lab.mtl
Starting task 9
Forked command at 27178
sh -c 'cd spark-1*; ./bin/spark-class 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
-Djava.library.path=/usr/lib/hadoop/lib/native 
akka.tcp://sp...@ca1-dcc1-0071.lab.mtl:32789/user/CoarseGrainedScheduler 
201311011608-1369465866-5050-9189-86 ca1-dcc1-0106.lab.mtl 1'
Command exited with status 1 (pid: 27178)
{noformat}


was (Author: srainville):
I'm seeing the same behavior when trying to set spark.executor.extraLibraryPath:

in conf/spark-defaults.conf:

{noformat}
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native
{noformat}

the error message in stderr:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:00:55.592289 27091 fetcher.cpp:73] Fetching URI 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:00:55.592428 27091 fetcher.cpp:99] Downloading resource from 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz' to 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:01:05.170714 27091 fetcher.cpp:61] Extracted resource 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
 into 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:01:06.105363 27166 exec.cpp:131] Version: 0.18.0
I0617 16:01:06.112191 27175 exec.cpp:205] Executor registered on slave 
201311011608-1369465866-5050-9189-86
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NumberFormatException: For input string: 
ca1-dcc1-0106.lab.mtl
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at

[jira] [Comment Edited] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true


[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034294#comment-14034294
 ] 

Sebastien Rainville edited comment on SPARK-2022 at 6/17/14 8:08 PM:
-

I'm seeing the same behavior when trying to set spark.executor.extraLibraryPath:

in conf/spark-defaults.conf:

{noformat}
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native
{noformat}

the error message in stderr:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:00:55.592289 27091 fetcher.cpp:73] Fetching URI 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:00:55.592428 27091 fetcher.cpp:99] Downloading resource from 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz' to 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:01:05.170714 27091 fetcher.cpp:61] Extracted resource 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
 into 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:01:06.105363 27166 exec.cpp:131] Version: 0.18.0
I0617 16:01:06.112191 27175 exec.cpp:205] Executor registered on slave 
201311011608-1369465866-5050-9189-86
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NumberFormatException: For input string: 
ca1-dcc1-0106.lab.mtl
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)

here is the command in stdout:


Registered executor on ca1-dcc1-0106.lab.mtl
Starting task 9
Forked command at 27178
sh -c 'cd spark-1*; ./bin/spark-class 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
-Djava.library.path=/usr/lib/hadoop/lib/native 
akka.tcp://sp...@ca1-dcc1-0071.lab.mtl:32789/user/CoarseGrainedScheduler 
201311011608-1369465866-5050-9189-86 ca1-dcc1-0106.lab.mtl 1'
Command exited with status 1 (pid: 27178)



was (Author: srainville):
I'm seeing the same behavior when trying to set spark.executor.extraLibraryPath:

in conf/spark-defaults.conf:

spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native

the error message in stderr:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:00:55.592289 27091 fetcher.cpp:73] Fetching URI 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:00:55.592428 27091 fetcher.cpp:99] Downloading resource from 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz' to 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:01:05.170714 27091 fetcher.cpp:61] Extracted resource 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
 into 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:01:06.105363 27166 exec.cpp:131] Version: 0.18.0
I0617 16:01:06.112191 27175 exec.cpp:205] Executor registered on slave 
201311011608-1369465866-5050-9189-86
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NumberFormatException: For input string: 
ca1-dcc1-0106.lab.mtl
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at

[jira] [Updated] (SPARK-2166) Enumerating instances to be terminated before the prompting the users to continue.


 [ 
https://issues.apache.org/jira/browse/SPARK-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2166:
---

Assignee: Jean-Martin Archer

 Enumerating instances to be terminated before the prompting the users to 
 continue.
 --

 Key: SPARK-2166
 URL: https://issues.apache.org/jira/browse/SPARK-2166
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 1.0.0
Reporter: Jean-Martin Archer
Assignee: Jean-Martin Archer
Priority: Minor
   Original Estimate: 0h
  Remaining Estimate: 0h

 When destroying a cluster, the user will be prompted for confirmation without 
 first showing which instances will be terminated.
 Pull Request: https://github.com/apache/spark/pull/270#issuecomment-46341975
 This pull request will list the EC2 instances before destroying the cluster.
 This was added because it can be scary to destroy EC2
 instances without knowing which one will be affected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2166) Enumerating instances to be terminated before the prompting the users to continue.


 [ 
https://issues.apache.org/jira/browse/SPARK-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2166:
---

Affects Version/s: 1.0.0

 Enumerating instances to be terminated before the prompting the users to 
 continue.
 --

 Key: SPARK-2166
 URL: https://issues.apache.org/jira/browse/SPARK-2166
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 1.0.0
Reporter: Jean-Martin Archer
Priority: Minor
   Original Estimate: 0h
  Remaining Estimate: 0h

 When destroying a cluster, the user will be prompted for confirmation without 
 first showing which instances will be terminated.
 Pull Request: https://github.com/apache/spark/pull/270#issuecomment-46341975
 This pull request will list the EC2 instances before destroying the cluster.
 This was added because it can be scary to destroy EC2
 instances without knowing which one will be affected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2166) Enumerating instances to be terminated before the prompting the users to continue.


 [ 
https://issues.apache.org/jira/browse/SPARK-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2166:
---

Target Version/s: 1.1.0

 Enumerating instances to be terminated before the prompting the users to 
 continue.
 --

 Key: SPARK-2166
 URL: https://issues.apache.org/jira/browse/SPARK-2166
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 1.0.0
Reporter: Jean-Martin Archer
Assignee: Jean-Martin Archer
Priority: Minor
   Original Estimate: 0h
  Remaining Estimate: 0h

 When destroying a cluster, the user will be prompted for confirmation without 
 first showing which instances will be terminated.
 Pull Request: https://github.com/apache/spark/pull/270#issuecomment-46341975
 This pull request will list the EC2 instances before destroying the cluster.
 This was added because it can be scary to destroy EC2
 instances without knowing which one will be affected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true


[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034294#comment-14034294
 ] 

Sebastien Rainville edited comment on SPARK-2022 at 6/17/14 8:41 PM:
-

I'm seeing the same behavior when trying to set spark.executor.extraLibraryPath:

in conf/spark-defaults.conf:

{noformat}
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native
{noformat}

the error message in stderr:

{noformat}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:00:55.592289 27091 fetcher.cpp:73] Fetching URI 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:00:55.592428 27091 fetcher.cpp:99] Downloading resource from 
'hdfs://ca1-dcc1-0071:9200/user/sebastien/spark-1.0.0-bin-cdh4-sebr.tgz' to 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
I0617 16:01:05.170714 27091 fetcher.cpp:61] Extracted resource 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2/spark-1.0.0-bin-cdh4-sebr.tgz'
 into 
'/u05/app/mesos/work/slaves/201311011608-1369465866-5050-9189-86/frameworks/20140416-011500-1369465866-5050-26096-0449/executors/9/runs/ba87d7b6-56c1-4892-9ed8-18fa8f8364d2'
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0617 16:01:06.105363 27166 exec.cpp:131] Version: 0.18.0
I0617 16:01:06.112191 27175 exec.cpp:205] Executor registered on slave 
201311011608-1369465866-5050-9189-86
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NumberFormatException: For input string: 
ca1-dcc1-0106.lab.mtl
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
{noformat}

here is the command in stdout:

{noformat}
Registered executor on ca1-dcc1-0106.lab.mtl
Starting task 9
Forked command at 27178
sh -c 'cd spark-1*; ./bin/spark-class 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
-Djava.library.path=/usr/lib/hadoop/lib/native 
akka.tcp://sp...@ca1-dcc1-0071.lab.mtl:32789/user/CoarseGrainedScheduler 
201311011608-1369465866-5050-9189-86 ca1-dcc1-0106.lab.mtl 1'
Command exited with status 1 (pid: 27178)
{noformat}

In fact, this behavior occurs whenever a jvmarg is set, so setting 
spark.executor.extraJavaOptions triggers it too.

The problem is that CoarseMesosSchedulerBackend is passing the jvmargs to 
CoarseGrainedExecutorBackend instead of the jvm itself:

{code}
val uri = conf.get(spark.executor.uri, null)
if (uri == null) {
  val runScript = new File(sparkHome, ./bin/spark-class).getCanonicalPath
  command.setValue(
\%s\ org.apache.spark.executor.CoarseGrainedExecutorBackend %s %s %s 
%s %d.format(
  runScript, extraOpts, driverUrl, offer.getSlaveId.getValue, 
offer.getHostname, numCores))
} else {
  // Grab everything to the first '.'. We'll use that and '*' to
  // glob the directory correctly.
  val basename = uri.split('/').last.split('.').head
  command.setValue(
(cd %s*;  +
  ./bin/spark-class 
org.apache.spark.executor.CoarseGrainedExecutorBackend %s %s %s %s %d)
  .format(basename, extraOpts, driverUrl, offer.getSlaveId.getValue,
offer.getHostname, numCores))
  command.addUris(CommandInfo.URI.newBuilder().setValue(uri))
}
{code}

as a reference, here's the main method in CoarseGrainedExecutorBackend:

{code}
def main(args: Array[String]) {
args.length match {
  case x if x  4 =
System.err.println(
  // Worker url is used in spark standalone mode to enforce 
fate-sharing with worker
  Usage: CoarseGrainedExecutorBackend driverUrl executorId 
hostname  +
  cores [workerUrl])
System.exit(1)
  case 4 =
run(args(0), args(1), args(2), args(3).toInt, None)
  case x if x  4 =
run(args(0), args(1), args(2), args(3).toInt, Some(args(4)))
}
  }
{code}




was (Author: srainville):
I'm seeing the same behavior when trying to set spark.executor.extraLibraryPath:

in conf/spark-defaults.conf:

{noformat}
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native
{noformat}

the error message in

[jira] [Created] (SPARK-2169) SparkUI.setAppName() has no effect

2014-06-17 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-2169:
-

 Summary: SparkUI.setAppName() has no effect
 Key: SPARK-2169
 URL: https://issues.apache.org/jira/browse/SPARK-2169
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin


{{SparkUI.setAppName()}} does not do anything useful.

It overwrites the instance's {{appName}} fields, but all places where that 
field is used have already read that value into their own copies by the time 
that happens.

e.g.

StagePage.scala copies {{parent.appName}} into its own private {{appName}} in 
the constructor, which is called as part of SparkUI's constructor. So when you 
call {{SparkUI.setAppName}} it does not overwrite StagePage's copy, and so the 
UI still shows the old value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2170) Fix for global name 'PIPE' is not defined.

2014-06-17 Thread Grega Kespret (JIRA)

Grega Kespret created SPARK-2170:


 Summary: Fix for global name 'PIPE' is not defined.
 Key: SPARK-2170
 URL: https://issues.apache.org/jira/browse/SPARK-2170
 Project: Spark
  Issue Type: Bug
  Components: EC2
 Environment: $ python --version
Python 2.6.6

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:Debian GNU/Linux 6.0.9 (squeeze)
Release:6.0.9
Codename:   squeeze

Reporter: Grega Kespret
Priority: Minor


When running spark-ec2.py script, it fails with error NameError: global name 
'PIPE' is not defined.

Traceback (most recent call last):
  File ./spark_ec2.py, line 894, in module
main()
  File ./spark_ec2.py, line 886, in main
real_main()
  File ./spark_ec2.py, line 770, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, True)
  File ./spark_ec2.py, line 475, in setup_cluster
dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh'])
  File ./spark_ec2.py, line 709, in ssh_read
ssh_command(opts) + ['%s@%s' % (opts.user, host), 
stringify_command(command)])
  File ./spark_ec2.py, line 696, in _check_output
process = subprocess.Popen(stdout=PIPE, *popenargs, **kwargs)
NameError: global name 'PIPE' is not defined



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2170) Fix for global name 'PIPE' is not defined.

2014-06-17 Thread Grega Kespret (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034372#comment-14034372
 ] 

Grega Kespret commented on SPARK-2170:
--

Added a Pull request that resolves this issue - 
https://github.com/apache/spark/pull/1109

 Fix for global name 'PIPE' is not defined.
 --

 Key: SPARK-2170
 URL: https://issues.apache.org/jira/browse/SPARK-2170
 Project: Spark
  Issue Type: Bug
  Components: EC2
 Environment: $ python --version
 Python 2.6.6
 $ lsb_release -a
 No LSB modules are available.
 Distributor ID: Debian
 Description:Debian GNU/Linux 6.0.9 (squeeze)
 Release:6.0.9
 Codename:   squeeze
Reporter: Grega Kespret
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 When running spark-ec2.py script, it fails with error NameError: global name 
 'PIPE' is not defined.
 Traceback (most recent call last):
   File ./spark_ec2.py, line 894, in module
 main()
   File ./spark_ec2.py, line 886, in main
 real_main()
   File ./spark_ec2.py, line 770, in real_main
 setup_cluster(conn, master_nodes, slave_nodes, opts, True)
   File ./spark_ec2.py, line 475, in setup_cluster
 dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh'])
   File ./spark_ec2.py, line 709, in ssh_read
 ssh_command(opts) + ['%s@%s' % (opts.user, host), 
 stringify_command(command)])
   File ./spark_ec2.py, line 696, in _check_output
 process = subprocess.Popen(stdout=PIPE, *popenargs, **kwargs)
 NameError: global name 'PIPE' is not defined



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2171) Groovy bindings for Spark

2014-06-17 Thread Artur Andrzejak (JIRA)

Artur Andrzejak created SPARK-2171:
--

 Summary: Groovy bindings for Spark
 Key: SPARK-2171
 URL: https://issues.apache.org/jira/browse/SPARK-2171
 Project: Spark
  Issue Type: Improvement
  Components: Build, Documentation, Examples
Affects Versions: 1.0.0
Reporter: Artur Andrzejak
Priority: Minor


A simple way to add Groovy bindings to Spark, without additional code.

The idea is to use the standard java implementations of RDD and
Context and to use the coercion of Groovy closure to abstract classes
to call all methods, which take anonymous inner classes in Java, with a closure.
Advantages:
- No need for new code, which avoids unnecessary bugs and implementation effort
- Access to spark from Groovy with the ease of closures using the
default Java implementations
- No need to install additional software



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2171) Groovy bindings for Spark

2014-06-17 Thread Artur Andrzejak (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur Andrzejak updated SPARK-2171:
---

Attachment: examples-Groovy4Spark.zip
Groovy4Spark - Introduction.pdf

A short guide for using Spark from Groovy and examples of Groovy code

 Groovy bindings for Spark
 -

 Key: SPARK-2171
 URL: https://issues.apache.org/jira/browse/SPARK-2171
 Project: Spark
  Issue Type: Improvement
  Components: Build, Documentation, Examples
Affects Versions: 1.0.0
Reporter: Artur Andrzejak
Priority: Minor
 Attachments: Groovy4Spark - Introduction.pdf, 
 examples-Groovy4Spark.zip


 A simple way to add Groovy bindings to Spark, without additional code.
 The idea is to use the standard java implementations of RDD and
 Context and to use the coercion of Groovy closure to abstract classes
 to call all methods, which take anonymous inner classes in Java, with a 
 closure.
 Advantages:
 - No need for new code, which avoids unnecessary bugs and implementation 
 effort
 - Access to spark from Groovy with the ease of closures using the
 default Java implementations
 - No need to install additional software



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7


 [ 
https://issues.apache.org/jira/browse/SPARK-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-1990:
--


In your PR, you should use subprocess.PIPE instead of calling PIPE directly. 
Could you submit a patch for it?

 spark-ec2 should only need Python 2.6, not 2.7
 --

 Key: SPARK-1990
 URL: https://issues.apache.org/jira/browse/SPARK-1990
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Anant Daksh Asthana
  Labels: Starter
 Fix For: 0.9.2, 1.0.1, 1.1.0


 There were some posts on the lists that spark-ec2 does not work with Python 
 2.6. In addition, we should check the Python version at the top of the script 
 and exit if it's too old.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1990) spark-ec2 should only need Python 2.6, not 2.7


 [ 
https://issues.apache.org/jira/browse/SPARK-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1990.
--

Resolution: Fixed

HOTFIX: https://github.com/apache/spark/pull/1108

 spark-ec2 should only need Python 2.6, not 2.7
 --

 Key: SPARK-1990
 URL: https://issues.apache.org/jira/browse/SPARK-1990
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Anant Daksh Asthana
  Labels: Starter
 Fix For: 0.9.2, 1.0.1, 1.1.0


 There were some posts on the lists that spark-ec2 does not work with Python 
 2.6. In addition, we should check the Python version at the top of the script 
 and exit if it's too old.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2168) History Server renered page not suitable for load balancing

2014-06-17 Thread Lukasz Jastrzebski (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lukasz Jastrzebski updated SPARK-2168:
--

Description:
Small issue but still.

I run history server through Marathon and balance it through haproxy. The
problem is that links generated by HistoryPage (links to completed
applications) are absolute, e.g. a
href=http://some-server:port/history/...;completedApplicationName/a , but
instead they should be relative, e.g. a
hfref=/history/...completedApplicationName/a, so they can be load
balanced.

was:
Small issue but still.

I run history server through Marathon and balance it through haproxy. The
problem is that links generated by HistoryPage (links to completed
applications) are absolute, e.g. a
href=http://some-server:port/history;http://some-server:port/history... , but
instead they should relative just /history, so they can be load balanced.

History Server renered page not suitable for load balancing
---

Key: SPARK-2168
URL: https://issues.apache.org/jira/browse/SPARK-2168
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.0
Reporter: Lukasz Jastrzebski
Priority: Minor

Small issue but still.
I run history server through Marathon and balance it through haproxy. The
problem is that links generated by HistoryPage (links to completed
applications) are absolute, e.g. a
href=http://some-server:port/history/...;completedApplicationName/a , but
instead they should be relative, e.g. a
hfref=/history/...completedApplicationName/a, so they can be load
balanced.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2172) PySpark cannot import mllib modules in YARN-client mode

2014-06-17 Thread Vlad Frolov (JIRA)

Vlad Frolov created SPARK-2172:
--

 Summary: PySpark cannot import mllib modules in YARN-client mode
 Key: SPARK-2172
 URL: https://issues.apache.org/jira/browse/SPARK-2172
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark, Spark Core, YARN
Affects Versions: 1.0.0, 1.1.0
 Environment: Ubuntu 14.04
Java 7
Python 2.7
CDH 5.0.2 (Hadoop 2.3.0): HDFS, YARN
Spark 1.0.0 and git master
Reporter: Vlad Frolov


Here is the simple reproduce code:

{code:title=issue.py|borderStyle=solid}
 from pyspark.mllib.regression import LabeledPoint

 sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).count()
{code}

Note: The same issue occurs with .collect() instead of .count()

{code:title=TraceBack|borderStyle=solid}
Py4JJavaError: An error occurred while calling o110.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8.0:0 
failed 4 times, most recent failure: Exception failure in TID 52 on host ares: 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
/mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/worker.py,
 line 73, in main
command = pickleSer._read_with_length(infile)
  File 
/mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/serializers.py,
 line 146, in _read_with_length
return self.loads(obj)
ImportError: No module named mllib.regression

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)

org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

However, this code works as expected:

{code:title=noissue.py|borderStyle=solid}
 from pyspark.mllib.regression import LabeledPoint

 sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).first()
 sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).take(3)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2172) PySpark cannot import mllib modules in YARN-client mode

2014-06-17 Thread Vlad Frolov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vlad Frolov updated SPARK-2172:
---

Description: 
Here is the simple reproduce code:

{noformat}
$ HADOOP_CONF_DIR=/etc/hadoop/conf MASTER=yarn-client ./bin/pyspark
{noformat}

{code:title=issue.py|borderStyle=solid}
 from pyspark.mllib.regression import LabeledPoint

 sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).count()
{code}

Note: The same issue occurs with .collect() instead of .count()

{code:title=TraceBack|borderStyle=solid}
Py4JJavaError: An error occurred while calling o110.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8.0:0 
failed 4 times, most recent failure: Exception failure in TID 52 on host ares: 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
/mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/worker.py,
 line 73, in main
command = pickleSer._read_with_length(infile)
  File 
/mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/serializers.py,
 line 146, in _read_with_length
return self.loads(obj)
ImportError: No module named mllib.regression

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)

org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

However, this code works as expected:

{code:title=noissue.py|borderStyle=solid}
 from pyspark.mllib.regression import LabeledPoint

 sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).first()
 sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).take(3)
{code}

  was:
Here is the simple reproduce code:

{code:title=issue.py|borderStyle=solid}
 from pyspark.mllib.regression import LabeledPoint

 sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).count()
{code}

Note: The same issue occurs with .collect() instead of .count()

{code:title=TraceBack|borderStyle=solid}
Py4JJavaError: An error occurred while calling o110.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8.0:0 
failed 4 times, most recent failure: Exception failure in TID 52 on host ares: 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File

[jira] [Created] (SPARK-2173) Add Master Computer and SuperStep Accumulator to Pregel GraphX Implement

Ted Malaska created SPARK-2173:
--

 Summary: Add Master Computer and SuperStep Accumulator to Pregel 
GraphX Implement
 Key: SPARK-2173
 URL: https://issues.apache.org/jira/browse/SPARK-2173
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Malaska


In Girpah there is an idea of a master compute and a global superstep value you 
can access.  I would like to add that to GraphX.

Let me know what you think.  I will try to get a pull request tonight.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2173) Add Master Computer and SuperStep Accumulator to Pregel GraphX Implemention


 [ 
https://issues.apache.org/jira/browse/SPARK-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Malaska updated SPARK-2173:
---

Summary: Add Master Computer and SuperStep Accumulator to Pregel GraphX 
Implemention  (was: Add Master Computer and SuperStep Accumulator to Pregel 
GraphX Implement)

 Add Master Computer and SuperStep Accumulator to Pregel GraphX Implemention
 ---

 Key: SPARK-2173
 URL: https://issues.apache.org/jira/browse/SPARK-2173
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Malaska

 In Girpah there is an idea of a master compute and a global superstep value 
 you can access.  I would like to add that to GraphX.
 Let me know what you think.  I will try to get a pull request tonight.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2174) Implement treeReduce and treeAggregate

Xiangrui Meng created SPARK-2174:


 Summary: Implement treeReduce and treeAggregate
 Key: SPARK-2174
 URL: https://issues.apache.org/jira/browse/SPARK-2174
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Spark Core
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


In `reduce` and `aggregate`, the driver node spends linear time on the number 
of partitions. It becomes a bottleneck when there are many partitions and the 
data from each partition is big.

SPARK-1485 tracks the progress of implementing AllReduce on Spark. I didn't 
several implementations including butterfly, reduce + broadcast, and treeReduce 
+ broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. 
Using binary tree may introduce some overhead in communication, because the 
driver still need to coordinate on data shuffling. In my experiments, n - 
sqrt(n) - 1 gives the best performance in general. But it certainly needs more 
testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2174) Implement treeReduce and treeAggregate

[
https://issues.apache.org/jira/browse/SPARK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng updated SPARK-2174:
-

Description:
In `reduce` and `aggregate`, the driver node spends linear time on the number
of partitions. It becomes a bottleneck when there are many partitions and the
data from each partition is big.

SPARK-1485 tracks the progress of implementing AllReduce on Spark. I did
several implementations including butterfly, reduce + broadcast, and treeReduce
+ broadcast. treeReduce + BT broadcast seems to be right way to go for Spark.
Using binary tree may introduce some overhead in communication, because the
driver still need to coordinate on data shuffling. In my experiments, n -
sqrt(n) - 1 gives the best performance in general. But it certainly needs more
testing.

was:
In `reduce` and `aggregate`, the driver node spends linear time on the number
of partitions. It becomes a bottleneck when there are many partitions and the
data from each partition is big.

SPARK-1485 tracks the progress of implementing AllReduce on Spark. I didn't
several implementations including butterfly, reduce + broadcast, and treeReduce
+ broadcast. treeReduce + BT broadcast seems to be right way to go for Spark.
Using binary tree may introduce some overhead in communication, because the
driver still need to coordinate on data shuffling. In my experiments, n -
sqrt(n) - 1 gives the best performance in general. But it certainly needs more
testing.

Implement treeReduce and treeAggregate
--

Key: SPARK-2174
URL: https://issues.apache.org/jira/browse/SPARK-2174
Project: Spark
Issue Type: New Feature
Components: MLlib, Spark Core
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

In `reduce` and `aggregate`, the driver node spends linear time on the number
of partitions. It becomes a bottleneck when there are many partitions and the
data from each partition is big.
SPARK-1485 tracks the progress of implementing AllReduce on Spark. I did
several implementations including butterfly, reduce + broadcast, and
treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go
for Spark. Using binary tree may introduce some overhead in communication,
because the driver still need to coordinate on data shuffling. In my
experiments, n - sqrt(n) - 1 gives the best performance in general. But it
certainly needs more testing.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2170) Fix for global name 'PIPE' is not defined.


 [ 
https://issues.apache.org/jira/browse/SPARK-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2170.


Resolution: Not a Problem

This was fixed already in a hotfix. But thanks for submitting the patch!

https://github.com/apache/spark/pull/1108

 Fix for global name 'PIPE' is not defined.
 --

 Key: SPARK-2170
 URL: https://issues.apache.org/jira/browse/SPARK-2170
 Project: Spark
  Issue Type: Bug
  Components: EC2
 Environment: $ python --version
 Python 2.6.6
 $ lsb_release -a
 No LSB modules are available.
 Distributor ID: Debian
 Description:Debian GNU/Linux 6.0.9 (squeeze)
 Release:6.0.9
 Codename:   squeeze
Reporter: Grega Kespret
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 When running spark-ec2.py script, it fails with error NameError: global name 
 'PIPE' is not defined.
 Traceback (most recent call last):
   File ./spark_ec2.py, line 894, in module
 main()
   File ./spark_ec2.py, line 886, in main
 real_main()
   File ./spark_ec2.py, line 770, in real_main
 setup_cluster(conn, master_nodes, slave_nodes, opts, True)
   File ./spark_ec2.py, line 475, in setup_cluster
 dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh'])
   File ./spark_ec2.py, line 709, in ssh_read
 ssh_command(opts) + ['%s@%s' % (opts.user, host), 
 stringify_command(command)])
   File ./spark_ec2.py, line 696, in _check_output
 process = subprocess.Popen(stdout=PIPE, *popenargs, **kwargs)
 NameError: global name 'PIPE' is not defined



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2060) Querying JSON Datasets with SQL and DSL in Spark SQL

2014-06-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2060.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

 Querying JSON Datasets with SQL and DSL in Spark SQL
 

 Key: SPARK-2060
 URL: https://issues.apache.org/jira/browse/SPARK-2060
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.0.1, 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2175) Null values when using App trait.

2014-06-17 Thread Brandon Amos (JIRA)

Brandon Amos created SPARK-2175:
---

 Summary: Null values when using App trait.
 Key: SPARK-2175
 URL: https://issues.apache.org/jira/browse/SPARK-2175
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Linux
Reporter: Brandon Amos
Priority: Trivial


See 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerExceptions-when-using-val-or-broadcast-on-a-standalone-cluster-tc7524.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2172) PySpark cannot import mllib modules in YARN-client mode

2014-06-17 Thread Vlad Frolov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034730#comment-14034730
 ] 

Vlad Frolov commented on SPARK-2172:


I've tried to run the code in standalone and local modes. There is no such 
error, but I want to exercise YARN.
I've also tried to run similar code in spark-shell (Scala) and it does well:

{code}
scala import org.apache.spark.mllib.regression.LabeledPoint
scala import org.apache.spark.mllib.linalg.{Vector, Vectors}
scala val array: Array[Double] = Array(1, 2)
scala val vector: Vector = Vectors.dense(array)
scala sc.parallelize(1 to 3).map(x = LabeledPoint(x, vector)).collect()
res2: Array[org.apache.spark.mllib.regression.LabeledPoint] = 
Array(LabeledPoint(1.0, [1.0,2.0]), LabeledPoint(2.0, [1.0,2.0]), 
LabeledPoint(3.0, [1.0,2.0]))
{code}

 PySpark cannot import mllib modules in YARN-client mode
 ---

 Key: SPARK-2172
 URL: https://issues.apache.org/jira/browse/SPARK-2172
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark, Spark Core, YARN
Affects Versions: 1.0.0, 1.1.0
 Environment: Ubuntu 14.04
 Java 7
 Python 2.7
 CDH 5.0.2 (Hadoop 2.3.0): HDFS, YARN
 Spark 1.0.0 and git master
Reporter: Vlad Frolov
  Labels: mllib, python

 Here is the simple reproduce code:
 {noformat}
 $ HADOOP_CONF_DIR=/etc/hadoop/conf MASTER=yarn-client ./bin/pyspark
 {noformat}
 {code:title=issue.py|borderStyle=solid}
  from pyspark.mllib.regression import LabeledPoint
  sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).count()
 {code}
 Note: The same issue occurs with .collect() instead of .count()
 {code:title=TraceBack|borderStyle=solid}
 Py4JJavaError: An error occurred while calling o110.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 8.0:0 failed 4 times, most recent failure: Exception failure in TID 52 on 
 host ares: org.apache.spark.api.python.PythonException: Traceback (most 
 recent call last):
   File 
 /mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/worker.py,
  line 73, in main
 command = pickleSer._read_with_length(infile)
   File 
 /mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/serializers.py,
  line 146, in _read_with_length
 return self.loads(obj)
 ImportError: No module named mllib.regression
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at

[jira] [Reopened] (SPARK-2038) Don't shadow conf variable in saveAsHadoop functions


 [ 
https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-2038:



 Don't shadow conf variable in saveAsHadoop functions
 --

 Key: SPARK-2038
 URL: https://issues.apache.org/jira/browse/SPARK-2038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Nan Zhu
Priority: Minor
 Fix For: 1.1.0


 This could lead to a lot of bugs. We should just change it to hadoopConf. I 
 noticed this when reviewing SPARK-1677.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2038) Don't shadow conf variable in saveAsHadoop functions


 [ 
https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2038.


Resolution: Won't Fix

 Don't shadow conf variable in saveAsHadoop functions
 --

 Key: SPARK-2038
 URL: https://issues.apache.org/jira/browse/SPARK-2038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Nan Zhu
Priority: Minor
 Fix For: 1.1.0


 This could lead to a lot of bugs. We should just change it to hadoopConf. I 
 noticed this when reviewing SPARK-1677.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2038) Don't shadow conf variable in saveAsHadoop functions


[ 
https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034735#comment-14034735
 ] 

Patrick Wendell commented on SPARK-2038:


Unfortunately after discussion with Reynold, I realized we have to revert this. 
The issue is that we can't change parameter names in public API's because scala 
allows functions to pass arguments by name:

http://docs.scala-lang.org/tutorials/tour/named-parameters.html

So a change like this could break source compatibility for users.

 Don't shadow conf variable in saveAsHadoop functions
 --

 Key: SPARK-2038
 URL: https://issues.apache.org/jira/browse/SPARK-2038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Nan Zhu
Priority: Minor
 Fix For: 1.1.0


 This could lead to a lot of bugs. We should just change it to hadoopConf. I 
 noticed this when reviewing SPARK-1677.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2038) Don't shadow conf variable in saveAsHadoop functions

2014-06-17 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034739#comment-14034739
 ] 

Nan Zhu commented on SPARK-2038:


Ah, I seethat's fine...

 Don't shadow conf variable in saveAsHadoop functions
 --

 Key: SPARK-2038
 URL: https://issues.apache.org/jira/browse/SPARK-2038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Nan Zhu
Priority: Minor
 Fix For: 1.1.0


 This could lead to a lot of bugs. We should just change it to hadoopConf. I 
 noticed this when reviewing SPARK-1677.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-791) [pyspark] operator.getattr not serialized

2014-06-17 Thread Mark Baker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034779#comment-14034779
 ] 

Mark Baker commented on SPARK-791:
--

I began porting Pyspark to Python 3, but with my modest Python-fu, hit a wall 
at cloudpickle. Dill supports Python 3, so seems like a big win in that 
direction too.

 [pyspark] operator.getattr not serialized
 -

 Key: SPARK-791
 URL: https://issues.apache.org/jira/browse/SPARK-791
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.7.2, 0.9.0
Reporter: Jim Blomo
Priority: Minor

 Using operator.itemgetter as a function in map seems to confuse the 
 serialization process in pyspark.  I'm using itemgetter to return tuples, 
 which fails with a TypeError (details below).  Using an equivalent lambda 
 function returns the correct result.
 Use a test file:
 {code:sh}
 echo 1,1  test.txt
 {code}
 Then try mapping it to a tuple:
 {code:python}
 import csv
 sc.textFile(test.txt).mapPartitions(csv.reader).map(lambda l: 
 (l[0],l[1])).first()
 Out[7]: ('1', '1')
 {code}
 But this does not work when using operator.itemgetter:
 {code:python}
 import operator
 sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first()
 # TypeError: list indices must be integers, not tuple
 {code}
 This is running with git master, commit 
 6d60fe571a405eb9306a2be1817901316a46f892
 IPython 0.13.2 
 java version 1.7.0_25
 Scala code runner version 2.9.1 
 Ubuntu 12.04
 Full debug output:
 {code:python}
 In [9]: 
 sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first()
 13/07/04 16:19:49 INFO storage.MemoryStore: ensureFreeSpace(33632) called 
 with curMem=201792, maxMem=339585269
 13/07/04 16:19:49 INFO storage.MemoryStore: Block broadcast_6 stored as 
 values to memory (estimated size 32.8 KB, free 323.6 MB)
 13/07/04 16:19:49 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 13/07/04 16:19:49 INFO spark.SparkContext: Starting job: takePartition at 
 NativeMethodAccessorImpl.java:-2
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Got job 4 (takePartition at 
 NativeMethodAccessorImpl.java:-2) with 1 output partitions (allowLocal=true)
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Final stage: Stage 4 
 (PythonRDD at NativeConstructorAccessorImpl.java:-2)
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Parents of final stage: List()
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Missing parents: List()
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Computing the requested 
 partition locally
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Failed to run takePartition at 
 NativeMethodAccessorImpl.java:-2
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-9-1fdb3e7a8ac7 in module()
  1 
 sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first()
 /home/jim/src/spark/python/pyspark/rdd.pyc in first(self)
 389 2
 390 
 -- 391 return self.take(1)[0]
 392 
 393 def saveAsTextFile(self, path):
 /home/jim/src/spark/python/pyspark/rdd.pyc in take(self, num)
 372 items = []
 373 for partition in range(self._jrdd.splits().size()):
 -- 374 iterator = self.ctx._takePartition(self._jrdd.rdd(), 
 partition)
 375 # Each item in the iterator is a string, Python object, 
 batch of
 376 # Python objects.  Regardless, it is sufficient to take 
 `num`
 /home/jim/src/spark/python/lib/py4j0.7.egg/py4j/java_gateway.pyc in 
 __call__(self, *args)
 498 answer = self.gateway_client.send_command(command)
 499 return_value = get_return_value(answer, self.gateway_client,
 -- 500 self.target_id, self.name)
 501 
 502 for temp_arg in temp_args:
 /home/jim/src/spark/python/lib/py4j0.7.egg/py4j/protocol.pyc in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling 
 z:spark.api.python.PythonRDD.takePartition.
 : spark.api.python.PythonException: Traceback (most recent call last):
   File /home/jim/src/spark/python/pyspark/worker.py, line 53, in main
 for obj in func(split_index, iterator):
   File /home/jim/src/spark/python/pyspark/serializers.py, line 24, in 
 batched
 for item in iterator:
 TypeError: list indices must be integers, not tuple
   at

[jira] [Commented] (SPARK-2173) Add Master Computer and SuperStep Accumulator to Pregel GraphX Implemention


[ 
https://issues.apache.org/jira/browse/SPARK-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034798#comment-14034798
 ] 

Ted Malaska commented on SPARK-2173:


Nope a broadcast wont work either.  Let me think about it over night.  Maybe 
the solution is simply just update the VertexRDD.innerJoin method to take i 
which is the super step.

 Add Master Computer and SuperStep Accumulator to Pregel GraphX Implemention
 ---

 Key: SPARK-2173
 URL: https://issues.apache.org/jira/browse/SPARK-2173
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Malaska

 In Girpah there is an idea of a master compute and a global superstep value 
 you can access.  I would like to add that to GraphX.
 Let me know what you think.  I will try to get a pull request tonight.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2173) Add Master Computer and SuperStep Accumulator to Pregel GraphX Implemention


[ 
https://issues.apache.org/jira/browse/SPARK-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034816#comment-14034816
 ] 

Ted Malaska commented on SPARK-2173:


Sorry I'm slow I just realized I don't even need superStep of master computer.  
In the confines of Giraph I did need them to solve tree rooting, but in the 
world of GraphX and can enter and exit Pregel when ever and how often I want.  

So in the case of tree rooting.  I would do max one super step of pregel to 
broadcast to all my children to identify my roots then start a second pregel 
with un bound super steps to root all the other vertices to the roots.

GraphX is so freeing in comparison to Giraph.

I will close the ticket now.

 Add Master Computer and SuperStep Accumulator to Pregel GraphX Implemention
 ---

 Key: SPARK-2173
 URL: https://issues.apache.org/jira/browse/SPARK-2173
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Malaska

 In Girpah there is an idea of a master compute and a global superstep value 
 you can access.  I would like to add that to GraphX.
 Let me know what you think.  I will try to get a pull request tonight.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2173) Add Master Computer and SuperStep Accumulator to Pregel GraphX Implemention