Did you have a look at http://spark.apache.org/docs/1.2.0/building-spark.html
I think you can simply download the source and build for your hadoop version as: mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package Thanks Best Regards On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario <mnaza...@palantir.com> wrote: > I also forgot some other information. I have made this error go away by > making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but > communicate with a spark 1.2 master and worker. It's not a good workaround, > so I would like to have the driver also be spark 1.2 > > Michael > ------------------------------ > *From:* Michael Nazario > *Sent:* Wednesday, February 11, 2015 10:13 PM > *To:* user@spark.apache.org > *Subject:* PySpark 1.2 Hadoop version mismatch > > Hi Spark users, > > I seem to be having this consistent error which I have been trying to > reproduce and narrow down the problem. I've been running a PySpark > application on Spark 1.2 reading avro files from Hadoop. I was consistently > seeing the following error: > > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. > : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot > communicate with client version 4 > > After some searching, I noticed that this most likely meant my hadoop > versions were mismatched. I had the following versions at the time: > > - Hadoop: hadoop-2.0.0-cdh4.7.0 > - Spark: spark-1.2.0-bin-cdh4.2.0 > > In the past, I've never had a problem with this setup for Spark 1.1.1 or > Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong > about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag: > > ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0 > > I then retried my previously mentioned application with this new build of > Spark. Same error. > > To narrow down the problem some more, I figured I should try out the > example which comes with spark which allows you to load an avro file. I ran > the below command (I know it uses a deprecated way of passing jars to the > driver classpath): > > SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH" > bin/spark-submit ./examples/src/main/python/avro_inputformat.py > "hdfs://localhost:8020/path/to/file.avro" > > I ended up with the same error. The full stacktrace is below. > > Traceback (most recent call last): > File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", > line 77, in <module> > conf=conf) > File "/git/spark/dist/python/pyspark/context.py", line 503, in > newAPIHadoopFile > jconf, batchSize) > File > "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. > : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot > communicate with client version 4 > at org.apache.hadoop.ipc.Client.call(Client.java:1113) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) > at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) > at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) > at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) > at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372) > at > org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774) > at > org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514) > at > org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469) > at > org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:724) > > I could foresee that possibly my avro-mapred jar is a problem. However it > is also hadoop 2 and hasn't had a problem in the past, so I don't believe > it is likely. > > Any suggestions for debugging, or more direct help into what is probably > wrong would be much appreciated. > > Michael >