Hi Spark users,

I seem to be having this consistent error which I have been trying to reproduce 
and narrow down the problem. I've been running a PySpark application on Spark 
1.2 reading avro files from Hadoop. I was consistently seeing the following 
error:

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4

After some searching, I noticed that this most likely meant my hadoop versions 
were mismatched. I had the following versions at the time:

  *   Hadoop: hadoop-2.0.0-cdh4.7.0
  *   Spark: spark-1.2.0-bin-cdh4.2.0

In the past, I've never had a problem with this setup for Spark 1.1.1 or Spark 
1.0.2. I figured it was worth me rebuilding Spark in case I was wrong about 
versions. To rebuild my Spark, I ran this command on the v1.2.0 tag:

./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0

I then retried my previously mentioned application with this new build of 
Spark. Same error.

To narrow down the problem some more, I figured I should try out the example 
which comes with spark which allows you to load an avro file. I ran the below 
command (I know it uses a deprecated way of passing jars to the driver 
classpath):

SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH"
 bin/spark-submit ./examples/src/main/python/avro_inputformat.py 
"hdfs://localhost:8020/path/to/file.avro"

I ended up with the same error. The full stacktrace is below.

Traceback (most recent call last):
  File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", line 
77, in <module>
    conf=conf)
  File "/git/spark/dist/python/pyspark/context.py", line 503, in 
newAPIHadoopFile
    jconf, batchSize)
  File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
line 538, in __call__
  File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
    at org.apache.hadoop.ipc.Client.call(Client.java:1113)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
    at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
    at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
    at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372)
    at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
    at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514)
    at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469)
    at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:724)

I could foresee that possibly my avro-mapred jar is a problem. However it is 
also hadoop 2 and hasn't had a problem in the past, so I don't believe it is 
likely.

Any suggestions for debugging, or more direct help into what is probably wrong 
would be much appreciated.

Michael

Reply via email to