Re: PySpark 1.2 Hadoop version mismatch

Akhil Das Wed, 11 Feb 2015 22:54:07 -0800

Did you have a look at
http://spark.apache.org/docs/1.2.0/building-spark.html


I think you can simply download the source and build for your hadoop
version as:

mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package


Thanks
Best Regards

On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario <mnaza...@palantir.com>
wrote:

>  I also forgot some other information. I have made this error go away by
> making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but
> communicate with a spark 1.2 master and worker. It's not a good workaround,
> so I would like to have the driver also be spark 1.2
>
> Michael
>  ------------------------------
> *From:* Michael Nazario
> *Sent:* Wednesday, February 11, 2015 10:13 PM
> *To:* user@spark.apache.org
> *Subject:* PySpark 1.2 Hadoop version mismatch
>
>   Hi Spark users,
>
> I seem to be having this consistent error which I have been trying to
> reproduce and narrow down the problem. I've been running a PySpark
> application on Spark 1.2 reading avro files from Hadoop. I was consistently
> seeing the following error:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
> : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot
> communicate with client version 4
>
> After some searching, I noticed that this most likely meant my hadoop
> versions were mismatched. I had the following versions at the time:
>
>    - Hadoop: hadoop-2.0.0-cdh4.7.0
>    - Spark: spark-1.2.0-bin-cdh4.2.0
>
> In the past, I've never had a problem with this setup for Spark 1.1.1 or
> Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong
> about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag:
>
> ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0
>
> I then retried my previously mentioned application with this new build of
> Spark. Same error.
>
> To narrow down the problem some more, I figured I should try out the
> example which comes with spark which allows you to load an avro file. I ran
> the below command (I know it uses a deprecated way of passing jars to the
> driver classpath):
>
> SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH"
> bin/spark-submit ./examples/src/main/python/avro_inputformat.py
> "hdfs://localhost:8020/path/to/file.avro"
>
> I ended up with the same error. The full stacktrace is below.
>
> Traceback (most recent call last):
>   File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py",
> line 77, in <module>
>     conf=conf)
>   File "/git/spark/dist/python/pyspark/context.py", line 503, in
> newAPIHadoopFile
>     jconf, batchSize)
>   File
> "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
> : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot
> communicate with client version 4
>     at org.apache.hadoop.ipc.Client.call(Client.java:1113)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>     at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
>     at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
>     at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
>     at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
>     at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
>     at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
>     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
>     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>     at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372)
>     at
> org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
>     at
> org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514)
>     at
> org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469)
>     at
> org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>     at py4j.Gateway.invoke(Gateway.java:259)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>     at java.lang.Thread.run(Thread.java:724)
>
> I could foresee that possibly my avro-mapred jar is a problem. However it
> is also hadoop 2 and hasn't had a problem in the past, so I don't believe
> it is likely.
>
> Any suggestions for debugging, or more direct help into what is probably
> wrong would be much appreciated.
>
> Michael
>

Re: PySpark 1.2 Hadoop version mismatch

Reply via email to