PySpark 1.2 Hadoop version mismatch

Michael Nazario Wed, 11 Feb 2015 22:15:03 -0800

Hi Spark users,

I seem to be having this consistent error which I have been trying to reproduce 
and narrow down the problem. I've been running a PySpark application on Spark 
1.2 reading avro files from Hadoop. I was consistently seeing the following 
error:


py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4

After some searching, I noticed that this most likely meant my hadoop versions 
were mismatched. I had the following versions at the time:

  *   Hadoop: hadoop-2.0.0-cdh4.7.0
  *   Spark: spark-1.2.0-bin-cdh4.2.0

In the past, I've never had a problem with this setup for Spark 1.1.1 or Spark 
1.0.2. I figured it was worth me rebuilding Spark in case I was wrong about 
versions. To rebuild my Spark, I ran this command on the v1.2.0 tag:

./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0

I then retried my previously mentioned application with this new build of 
Spark. Same error.

To narrow down the problem some more, I figured I should try out the example 
which comes with spark which allows you to load an avro file. I ran the below 
command (I know it uses a deprecated way of passing jars to the driver 
classpath):

SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH"
 bin/spark-submit ./examples/src/main/python/avro_inputformat.py 
"hdfs://localhost:8020/path/to/file.avro"

I ended up with the same error. The full stacktrace is below.

Traceback (most recent call last):
  File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", line 
77, in <module>
    conf=conf)
  File "/git/spark/dist/python/pyspark/context.py", line 503, in 
newAPIHadoopFile
    jconf, batchSize)
  File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
line 538, in __call__
  File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
    at org.apache.hadoop.ipc.Client.call(Client.java:1113)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
    at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
    at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
    at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372)
    at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
    at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514)
    at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469)
    at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:724)

I could foresee that possibly my avro-mapred jar is a problem. However it is 
also hadoop 2 and hasn't had a problem in the past, so I don't believe it is 
likely.

Any suggestions for debugging, or more direct help into what is probably wrong 
would be much appreciated.

Michael

PySpark 1.2 Hadoop version mismatch

Reply via email to