Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Davies Liu
Is it possible that you have Python 2.7 on the driver, but Python 2.6
on the workers?.

PySpark requires that you have the same minor version of Python in
both driver and worker. In PySpark 1.4+, it will do this check before
run any tasks.

On Mon, Aug 10, 2015 at 2:53 PM, YaoPau  wrote:
> I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
> iPython Notebook.  I'm getting collect() to work just fine, but take()
> errors.  (I'm having issues with collect() on other datasets ... but take()
> seems to break every time I run it.)
>
> My code is below.  Any thoughts?
>
>>> sc
> 
>>> sys.version
> '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
> 4.4.7 20120313 (Red Hat 4.4.7-1)]'
>>> hourly = sc.textFile('tester')
>>> hourly.collect()
> [u'a man',
>  u'a plan',
>  u'a canal',
>  u'panama']
>>> hourly = sc.textFile('tester')
>>> hourly.take(2)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 hourly = sc.textFile('tester')
> > 2 hourly.take(2)
>
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self, num)
>1223
>1224 p = range(partsScanned, min(partsScanned +
> numPartsToTry, totalParts))
> -> 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
> True)
>1226
>1227 items += res
>
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
> runJob(self, rdd, partitionFunc, partitions, allowLocal)
> 841 # SparkContext#runJob.
> 842 mappedRDD = rdd.mapPartitions(partitionFunc)
> --> 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
> mappedRDD._jrdd, javaPartitions, allowLocal)
> 844 return list(mappedRDD._collect_iterator_through_file(it))
> 845
>
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539
> 540 for temp_arg in temp_args:
>
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 10.0 (TID 47, dhd490101.autotrader.com):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>   File
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
> line 101, in main
> process()
>   File
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
> line 96, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py",
> line 236, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py", line
> 1220, in takeUpToNumLeft
> while taken < left:
> ImportError: No module named iter
>
> at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> 

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Ruslan Dautkhanov
There is was a similar problem reported before on this list.

Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?

>From error stack you use 2.7.10 |Anaconda 2.3.0
while OS/CDH version of Python is probably 2.6.



-- 
Ruslan Dautkhanov

On Mon, Aug 10, 2015 at 3:53 PM, YaoPau  wrote:

> I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
> iPython Notebook.  I'm getting collect() to work just fine, but take()
> errors.  (I'm having issues with collect() on other datasets ... but take()
> seems to break every time I run it.)
>
> My code is below.  Any thoughts?
>
> >> sc
> 
> >> sys.version
> '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
> 4.4.7 20120313 (Red Hat 4.4.7-1)]'
> >> hourly = sc.textFile('tester')
> >> hourly.collect()
> [u'a man',
>  u'a plan',
>  u'a canal',
>  u'panama']
> >> hourly = sc.textFile('tester')
> >> hourly.take(2)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 hourly = sc.textFile('tester')
> > 2 hourly.take(2)
>
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self,
> num)
>1223
>1224 p = range(partsScanned, min(partsScanned +
> numPartsToTry, totalParts))
> -> 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
> True)
>1226
>1227 items += res
>
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
> runJob(self, rdd, partitionFunc, partitions, allowLocal)
> 841 # SparkContext#runJob.
> 842 mappedRDD = rdd.mapPartitions(partitionFunc)
> --> 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
> mappedRDD._jrdd, javaPartitions, allowLocal)
> 844 return list(mappedRDD._collect_iterator_through_file(it))
> 845
>
>
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer,
> self.gateway_client,
> --> 538 self.target_id, self.name)
> 539
> 540 for temp_arg in temp_args:
>
>
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 10.0 (TID 47, dhd490101.autotrader.com):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>   File
>
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
> line 101, in main
> process()
>   File
>
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
> line 96, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File
>
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py",
> line 236, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py", line
> 1220, in takeUpToNumLeft
> while taken < left:
> ImportError: No module named iter
>
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176)
> at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
> 

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Jon Gregg
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was
the issue, so we've tried many combinations since then with all three of
2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and
PYTHONPATHs each time.  Every combination has produced the same error.

We came across a comment on the User board saying "Since you're using YARN,
you may also need to set SPARK_YARN_USER_ENV to
"PYSPARK_PYTHON=/your/desired/python/on/slave/nodes"." ... I couldn't find
SPARK_YARN_USER_ENV in the Spark 1.3 docs but we tried that as well and
couldn't get it working.

We're open to trying or re-trying any other ideas.

On Mon, Aug 10, 2015 at 6:25 PM, Ruslan Dautkhanov 
wrote:

> There is was a similar problem reported before on this list.
>
> Weird python errors like this generally mean you have different
> versions of python in the nodes of your cluster. Can you check that?
>
> From error stack you use 2.7.10 |Anaconda 2.3.0
> while OS/CDH version of Python is probably 2.6.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Aug 10, 2015 at 3:53 PM, YaoPau  wrote:
>
>> I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
>> iPython Notebook.  I'm getting collect() to work just fine, but take()
>> errors.  (I'm having issues with collect() on other datasets ... but
>> take()
>> seems to break every time I run it.)
>>
>> My code is below.  Any thoughts?
>>
>> >> sc
>> 
>> >> sys.version
>> '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
>> 4.4.7 20120313 (Red Hat 4.4.7-1)]'
>> >> hourly = sc.textFile('tester')
>> >> hourly.collect()
>> [u'a man',
>>  u'a plan',
>>  u'a canal',
>>  u'panama']
>> >> hourly = sc.textFile('tester')
>> >> hourly.take(2)
>>
>> ---
>> Py4JJavaError Traceback (most recent call
>> last)
>>  in ()
>>   1 hourly = sc.textFile('tester')
>> > 2 hourly.take(2)
>>
>> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self,
>> num)
>>1223
>>1224 p = range(partsScanned, min(partsScanned +
>> numPartsToTry, totalParts))
>> -> 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
>> True)
>>1226
>>1227 items += res
>>
>> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
>> runJob(self, rdd, partitionFunc, partitions, allowLocal)
>> 841 # SparkContext#runJob.
>> 842 mappedRDD = rdd.mapPartitions(partitionFunc)
>> --> 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
>> mappedRDD._jrdd, javaPartitions, allowLocal)
>> 844 return list(mappedRDD._collect_iterator_through_file(it))
>> 845
>>
>>
>> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>> in __call__(self, *args)
>> 536 answer = self.gateway_client.send_command(command)
>> 537 return_value = get_return_value(answer,
>> self.gateway_client,
>> --> 538 self.target_id, self.name)
>> 539
>> 540 for temp_arg in temp_args:
>>
>>
>> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>> in get_return_value(answer, gateway_client, target_id, name)
>> 298 raise Py4JJavaError(
>> 299 'An error occurred while calling
>> {0}{1}{2}.\n'.
>> --> 300 format(target_id, '.', name), value)
>> 301 else:
>> 302 raise Py4JError(
>>
>> Py4JJavaError: An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.runJob.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0
>> in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 10.0 (TID 47, dhd490101.autotrader.com):
>> org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):
>>   File
>>
>> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
>> line 101, in main
>> process()
>>   File
>>
>> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
>> line 96, in process
>> serializer.dump_stream(func(split_index, iterator), outfile)
>>   File
>>
>> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py",
>> line 236, in dump_stream
>> vs = list(itertools.islice(iterator, batch))
>>   File "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py", line
>> 1220, in takeUpToNumLeft
>> while taken < left:
>> ImportError: No module named iter
>>
>> at
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
>> at
>> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176)
>> at
>> org.apache.spark.api.python

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-12 Thread YaoPau
In case anyone runs into this issue in the future, we got it working: the
following variable must be set on the edge node:

export
PYSPARK_PYTHON=/your/path/to/whatever/python/you/want/to/run/bin/python

I didn't realize that variable gets passed to every worker node.  All I saw
when searching for this issue was documentation for an older version of
Spark which mentions using SPARK_YARN_USER_ENV to set PYSPARK_PYTHON within
spark-env.sh, which didn't work for us on Spark 1.3.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-works-take-returns-ImportError-No-module-named-iter-tp24199p24234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org