[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Evan Samanas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146639#comment-14146639
 ] 

Evan Samanas commented on SPARK-3662:
-

I wouldn't focus on the example, that I modified it, or whether I should be 
importing a small portion of pandas.  The issue here is that Spark breaks in 
this case because of a name collision.  Modifying the example is simply the one 
reproducer I've found.

I was modifying the example to learn about how Spark ships Python code to the 
cluster.  In this case, I expected pandas to only be imported in the driver 
program and not to be imported by any workers.  The workers do not have pandas 
installed, so expected behavior means the example would run to completion, and 
an ImportError would mean that the workers are importing things they don't need 
for the task at hand.

The way I expected Spark to work IS actually how Spark works...modules will 
only be imported by workers if a function passed to them uses the modules, but 
this error showed me false evidence to the contrary.  I'm assuming the error is 
in Spark's modifications to CloudPickle...not in the way the example is set up.

> Importing pandas breaks included pi.py example
> --
>
> Key: SPARK-3662
> URL: https://issues.apache.org/jira/browse/SPARK-3662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.1.0
> Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
>Reporter: Evan Samanas
>
> If I add "import pandas" at the top of the included pi.py example and submit 
> using "spark-submit --master yarn-client", I get this stack trace:
> {code}
> Traceback (most recent call last):
>   File "/home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py", line 
> 39, in 
> count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 759, in reduce
> vals = self.mapPartitions(func).collect()
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 723, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
> 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
> org.apache.spark.api.python.PythonException (Traceback (most recent call 
> last):
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
>  line 75, in main
> command = pickleSer._read_with_length(infile)
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
>  line 150, in _read_with_length
> return self.loads(obj)
> ImportError: No module named algos
> {code}
> The example works fine if I move the statement "from random import random" 
> from the top and into the function (def f(_)) defined in the example.  Near 
> as I can tell, "random" is getting confused with a function of the same name 
> within pandas.algos.  
> Submitting the same script using --master local works, but gives a 
> distressing amount of random characters to stdout or stderr and messes up my 
> terminal:
> {code}
> ...
> @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
> @J!@J"@J#@J$@J%@J&@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J<@J=@J>@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
>AJ
> AJ
>   AJ
> AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
> AJ!AJ"AJ#AJ$AJ%AJ&AJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
>  15:42:09 INFO SparkContext: Job finished: reduce at 
> /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
> 11.276879779 s
> J�AJ�AJ�AJ�AJ�AJ�AJ�AJ

[jira] [Created] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-23 Thread Evan Samanas (JIRA)
Evan Samanas created SPARK-3662:
---

 Summary: Importing pandas breaks included pi.py example
 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas


If I add "import pandas" at the top of the included pi.py example and submit 
using "spark-submit --master yarn-client", I get this stack trace:

{code}
Traceback (most recent call last):
  File "/home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py", line 
39, in 
count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
  File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 759, in reduce
vals = self.mapPartitions(func).collect()
  File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 723, in collect
bytesInJava = self._jrdd.collect().iterator()
  File 
"/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value
py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 2.3 
in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last):
  File 
"/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
 line 75, in main
command = pickleSer._read_with_length(infile)
  File 
"/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
 line 150, in _read_with_length
return self.loads(obj)
ImportError: No module named algos

{code}

The example works fine if I move the statement "from random import random" from 
the top and into the function (def f(_)) defined in the example.  Near as I can 
tell, "random" is getting confused with a function of the same name within 
pandas.algos.  


Submitting the same script using --master local works, but gives a distressing 
amount of random characters to stdout or stderr and messes up my terminal:

{code}
...
@J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
@J!@J"@J#@J$@J%@J&@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J<@J=@J>@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
   AJ
AJ
  AJ
AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
AJ!AJ"AJ#AJ$AJ%AJ&AJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
 15:42:09 INFO SparkContext: Job finished: reduce at 
/home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
11.276879779 s
J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
 BJ
BJ
  BJ
BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
BJ!BJ"BJ#BJ$BJ%BJ&BJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ?BJ@Be.
�]qJ#1a.
�]qJX4a.
�]qJX4a.
�]qJ#1a.
�]qJX4a.
�]qJX4a.
�]qJ#1a.
�]qJX4a.
�]qJX4a.
�]qJa.
Pi is roughly 3.146136
{code}

No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org