spark-ec2 script

Andrew Davidson (JIRA) Fri, 06 Nov 2015 12:38:23 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Davidson reopened SPARK-11509:
-------------------------------------

My issue is not resolved

I am able to use ipython notebooks on my local mac but still can not run them 
in/on my cluster.  

I have tried launching the same way I do on my mac

export PYSPARK_PYTHON=python2.7
export PYSPARK_DRIVER_PYTHON=python2.7
IPYTHON_OPTS="notebook --no-browser --port=7000"  $SPARK_ROOT/bin/pyspark 

I also tried setting 
export PYSPARK_PYTHON=python2.7

in /root/spark/conf/spark-env.sh on all my machines

The following code example

from pyspark import SparkContext
sc = SparkContext("local", "Simple App") # strange we should not have to create 
sc
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3)

generates the following error msg

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-1-e0006b323300> in <module>()
      2 sc = SparkContext("local", "Simple App")
      3 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
----> 4 textFile.take(3)

/root/spark/python/pyspark/rdd.py in take(self, num)
   1297 
   1298             p = range(partsScanned, min(partsScanned + numPartsToTry, 
totalParts))
-> 1299             res = self.context.runJob(self, takeUpToNumLeft, p)
   1300 
   1301             items += res

/root/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
partitions, allowLocal)
    914         # SparkContext#runJob.
    915         mappedRDD = rdd.mapPartitions(partitionFunc)
--> 916         port = self._jvm.PythonRDD.runJob(self._jsc.sc(), 
mappedRDD._jrdd, partitions)
    917         return list(_load_from_socket(port, 
mappedRDD._jrdd_deserializer))
    918 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 2.6, 
PySpark cannot run with different minor versions

        at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at 
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
        at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
        at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 2.6, 
PySpark cannot run with different minor versions

        at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at 
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

Any suggestions?

Kind regards

Andy

> ipython notebooks do not work on clusters created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11509
>                 URL: https://issues.apache.org/jira/browse/SPARK-11509
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, EC2, PySpark
>    Affects Versions: 1.5.1
>         Environment: AWS cluster
> [ec2-user@ip-172-31-29-60 ~]$ uname -a
> Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
> SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Andrew Davidson
>
> I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.
> I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
> able to run the java SparkPi example on the cluster how ever I am not able to 
> run ipython notebooks on the cluster. (I connect using ssh tunnel)
> According to the 1.5.1 getting started doc 
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> The following should work
>  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
> --no-browser --port=7000" /root/spark/bin/pyspark
> I am able to connect to the notebook server and start a notebook how ever
> bug 1) the default sparkContext does not exist
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> ---------------------------------------------------------------------------
> NameError                                 Traceback (most recent call last)
> <ipython-input-1-127b6a58d5cc> in <module>()
>       1 from pyspark import SparkContext
> ----> 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
>       3 textFile.take(3)
> NameError: name 'sc' is not defined
> bug 2)
>  If I create a SparkContext I get the following python versions miss match 
> error
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
>  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
>     ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 
> 2.6, PySpark cannot run with different minor versions
> I am able to run ipython notebooks on my local Mac as follows. (by default 
> you would get an error that the driver and works are using different version 
> of python)
> $ cat ~/bin/pySparkNotebook.sh
> #!/bin/sh 
> set -x # turn debugging on
> #set +x # turn debugging off
> export PYSPARK_PYTHON=python3
> export PYSPARK_DRIVER_PYTHON=python3
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 
> I have spent a lot of time trying to debug the pyspark script however I can 
> not figure out what the problem is
> Please let me know if there is something I can do to help
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

Reply via email to