Hi Spark users, I seem to be having this consistent error which I have been trying to reproduce and narrow down the problem. I've been running a PySpark application on Spark 1.2 reading avro files from Hadoop. I was consistently seeing the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 After some searching, I noticed that this most likely meant my hadoop versions were mismatched. I had the following versions at the time: * Hadoop: hadoop-2.0.0-cdh4.7.0 * Spark: spark-1.2.0-bin-cdh4.2.0 In the past, I've never had a problem with this setup for Spark 1.1.1 or Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag: ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0 I then retried my previously mentioned application with this new build of Spark. Same error. To narrow down the problem some more, I figured I should try out the example which comes with spark which allows you to load an avro file. I ran the below command (I know it uses a deprecated way of passing jars to the driver classpath): SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH" bin/spark-submit ./examples/src/main/python/avro_inputformat.py "hdfs://localhost:8020/path/to/file.avro" I ended up with the same error. The full stacktrace is below. Traceback (most recent call last): File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", line 77, in <module> conf=conf) File "/git/spark/dist/python/pyspark/context.py", line 503, in newAPIHadoopFile jconf, batchSize) File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1113) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469) at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:724) I could foresee that possibly my avro-mapred jar is a problem. However it is also hadoop 2 and hasn't had a problem in the past, so I don't believe it is likely. Any suggestions for debugging, or more direct help into what is probably wrong would be much appreciated. Michael