Re: collect() works, take() returns ImportError: No module named iter

2015-08-13 Thread YaoPau
In case anyone runs into this issue in the future, we got it working: the
following variable must be set on the edge node:

export
PYSPARK_PYTHON=/your/path/to/whatever/python/you/want/to/run/bin/python

I didn't realize that variable gets passed to every worker node.  All I saw
when searching for this issue was documentation for an older version of
Spark which mentions using SPARK_YARN_USER_ENV to set PYSPARK_PYTHON within
spark-env.sh, which didn't work for us on Spark 1.3.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-works-take-returns-ImportError-No-module-named-iter-tp24199p24234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Ruslan Dautkhanov
There is was a similar problem reported before on this list.

Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?

From error stack you use 2.7.10 |Anaconda 2.3.0
while OS/CDH version of Python is probably 2.6.



-- 
Ruslan Dautkhanov

On Mon, Aug 10, 2015 at 3:53 PM, YaoPau jonrgr...@gmail.com wrote:

 I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
 iPython Notebook.  I'm getting collect() to work just fine, but take()
 errors.  (I'm having issues with collect() on other datasets ... but take()
 seems to break every time I run it.)

 My code is below.  Any thoughts?

  sc
 pyspark.context.SparkContext at 0x7ffbfa310f10
  sys.version
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
 4.4.7 20120313 (Red Hat 4.4.7-1)]'
  hourly = sc.textFile('tester')
  hourly.collect()
 [u'a man',
  u'a plan',
  u'a canal',
  u'panama']
  hourly = sc.textFile('tester')
  hourly.take(2)
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-15-1feecba5868b in module()
   1 hourly = sc.textFile('tester')
  2 hourly.take(2)

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self,
 num)
1223
1224 p = range(partsScanned, min(partsScanned +
 numPartsToTry, totalParts))
 - 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
 True)
1226
1227 items += res

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
 runJob(self, rdd, partitionFunc, partitions, allowLocal)
 841 # SparkContext#runJob.
 842 mappedRDD = rdd.mapPartitions(partitionFunc)
 -- 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
 mappedRDD._jrdd, javaPartitions, allowLocal)
 844 return list(mappedRDD._collect_iterator_through_file(it))
 845


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer,
 self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(

 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 10.0 (TID 47, dhd490101.autotrader.com):
 org.apache.spark.api.python.PythonException: Traceback (most recent call
 last):
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 101, in main
 process()
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py,
 line 236, in dump_stream
 vs = list(itertools.islice(iterator, batch))
   File /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py, line
 1220, in takeUpToNumLeft
 while taken  left:
 ImportError: No module named iter

 at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at

 

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Davies Liu
Is it possible that you have Python 2.7 on the driver, but Python 2.6
on the workers?.

PySpark requires that you have the same minor version of Python in
both driver and worker. In PySpark 1.4+, it will do this check before
run any tasks.

On Mon, Aug 10, 2015 at 2:53 PM, YaoPau jonrgr...@gmail.com wrote:
 I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
 iPython Notebook.  I'm getting collect() to work just fine, but take()
 errors.  (I'm having issues with collect() on other datasets ... but take()
 seems to break every time I run it.)

 My code is below.  Any thoughts?

 sc
 pyspark.context.SparkContext at 0x7ffbfa310f10
 sys.version
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
 4.4.7 20120313 (Red Hat 4.4.7-1)]'
 hourly = sc.textFile('tester')
 hourly.collect()
 [u'a man',
  u'a plan',
  u'a canal',
  u'panama']
 hourly = sc.textFile('tester')
 hourly.take(2)
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-15-1feecba5868b in module()
   1 hourly = sc.textFile('tester')
  2 hourly.take(2)

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self, num)
1223
1224 p = range(partsScanned, min(partsScanned +
 numPartsToTry, totalParts))
 - 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
 True)
1226
1227 items += res

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
 runJob(self, rdd, partitionFunc, partitions, allowLocal)
 841 # SparkContext#runJob.
 842 mappedRDD = rdd.mapPartitions(partitionFunc)
 -- 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
 mappedRDD._jrdd, javaPartitions, allowLocal)
 844 return list(mappedRDD._collect_iterator_through_file(it))
 845

 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:

 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(

 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 10.0 (TID 47, dhd490101.autotrader.com):
 org.apache.spark.api.python.PythonException: Traceback (most recent call
 last):
   File
 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 101, in main
 process()
   File
 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File
 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py,
 line 236, in dump_stream
 vs = list(itertools.islice(iterator, batch))
   File /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py, line
 1220, in takeUpToNumLeft
 while taken  left:
 ImportError: No module named iter

 at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Jon Gregg
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was
the issue, so we've tried many combinations since then with all three of
2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and
PYTHONPATHs each time.  Every combination has produced the same error.

We came across a comment on the User board saying Since you're using YARN,
you may also need to set SPARK_YARN_USER_ENV to
PYSPARK_PYTHON=/your/desired/python/on/slave/nodes. ... I couldn't find
SPARK_YARN_USER_ENV in the Spark 1.3 docs but we tried that as well and
couldn't get it working.

We're open to trying or re-trying any other ideas.

On Mon, Aug 10, 2015 at 6:25 PM, Ruslan Dautkhanov dautkha...@gmail.com
wrote:

 There is was a similar problem reported before on this list.

 Weird python errors like this generally mean you have different
 versions of python in the nodes of your cluster. Can you check that?

 From error stack you use 2.7.10 |Anaconda 2.3.0
 while OS/CDH version of Python is probably 2.6.



 --
 Ruslan Dautkhanov

 On Mon, Aug 10, 2015 at 3:53 PM, YaoPau jonrgr...@gmail.com wrote:

 I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
 iPython Notebook.  I'm getting collect() to work just fine, but take()
 errors.  (I'm having issues with collect() on other datasets ... but
 take()
 seems to break every time I run it.)

 My code is below.  Any thoughts?

  sc
 pyspark.context.SparkContext at 0x7ffbfa310f10
  sys.version
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
 4.4.7 20120313 (Red Hat 4.4.7-1)]'
  hourly = sc.textFile('tester')
  hourly.collect()
 [u'a man',
  u'a plan',
  u'a canal',
  u'panama']
  hourly = sc.textFile('tester')
  hourly.take(2)

 ---
 Py4JJavaError Traceback (most recent call
 last)
 ipython-input-15-1feecba5868b in module()
   1 hourly = sc.textFile('tester')
  2 hourly.take(2)

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self,
 num)
1223
1224 p = range(partsScanned, min(partsScanned +
 numPartsToTry, totalParts))
 - 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
 True)
1226
1227 items += res

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
 runJob(self, rdd, partitionFunc, partitions, allowLocal)
 841 # SparkContext#runJob.
 842 mappedRDD = rdd.mapPartitions(partitionFunc)
 -- 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
 mappedRDD._jrdd, javaPartitions, allowLocal)
 844 return list(mappedRDD._collect_iterator_through_file(it))
 845


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer,
 self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling
 {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(

 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0
 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 10.0 (TID 47, dhd490101.autotrader.com):
 org.apache.spark.api.python.PythonException: Traceback (most recent call
 last):
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 101, in main
 process()
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py,
 line 236, in dump_stream
 vs = list(itertools.islice(iterator, batch))
   File /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py, line
 1220, in takeUpToNumLeft
 while taken  left:
 ImportError: No module named iter

 at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)