RE: PySpark 1.2 Hadoop version mismatch
I looked at the environment which I ran the spark-submit command in, and it looks like there is nothing that could be messing with the classpath. Just to be sure, I checked the web UI which says the classpath contains: - The two jars I added: /path/to/avro-mapred-1.7.4-hadoop2.jar and lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar - The spark assembly jar in the same location: /path/to/spark/lib/spark-assembly-1.2.0-hadoop2.0.0-cdh4.7.0.jar - The conf folder: /path/to/spark/conf - The python script I was running From: Sean Owen [so...@cloudera.com] Sent: Thursday, February 12, 2015 12:13 AM To: Akhil Das Cc: Michael Nazario; user@spark.apache.org Subject: Re: PySpark 1.2 Hadoop version mismatch No, "mr1" should not be the issue here, and I think that would break other things. The OP is not using mr1. client 4 / server 7 means roughly "client is Hadoop 1.x, server is Hadoop 2.0.x". Normally, I'd say I think you are packaging Hadoop code in your app by brining in Spark and its deps. Your app shouldn't have any of this code. If you're running straight examples though, I'm less sure. If anything, your client is later than your server. I wonder if you have anything else set on your local classpath via env variables? On Thu, Feb 12, 2015 at 6:52 AM, Akhil Das wrote: > Did you have a look at > https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_1.2.0_building-2Dspark.html&d=AwIBaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=yN4Yj1JskMkGMKoYoLUUIQViRLGShPc1wislP1YdU4g&m=qSlV9TOMGsnfA_9xycNM5biA5h11naL5ZuLVhMrxpHQ&s=vWYYgDt86TQENpK2Il3JBZHTEqQe3_bRp4TA83PUjkc&e= > > I think you can simply download the source and build for your hadoop version > as: > > mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package > > > Thanks > Best Regards > > On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario > wrote: >> >> I also forgot some other information. I have made this error go away by >> making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but >> communicate with a spark 1.2 master and worker. It's not a good workaround, >> so I would like to have the driver also be spark 1.2 >> >> Michael >> >> From: Michael Nazario >> Sent: Wednesday, February 11, 2015 10:13 PM >> To: user@spark.apache.org >> Subject: PySpark 1.2 Hadoop version mismatch >> >> Hi Spark users, >> >> I seem to be having this consistent error which I have been trying to >> reproduce and narrow down the problem. I've been running a PySpark >> application on Spark 1.2 reading avro files from Hadoop. I was consistently >> seeing the following error: >> >> py4j.protocol.Py4JJavaError: An error occurred while calling >> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. >> : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot >> communicate with client version 4 >> >> After some searching, I noticed that this most likely meant my hadoop >> versions were mismatched. I had the following versions at the time: >> >> Hadoop: hadoop-2.0.0-cdh4.7.0 >> Spark: spark-1.2.0-bin-cdh4.2.0 >> >> In the past, I've never had a problem with this setup for Spark 1.1.1 or >> Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong >> about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag: >> >> ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0 >> >> I then retried my previously mentioned application with this new build of >> Spark. Same error. >> >> To narrow down the problem some more, I figured I should try out the >> example which comes with spark which allows you to load an avro file. I ran >> the below command (I know it uses a deprecated way of passing jars to the >> driver classpath): >> >> >> SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH" >> bin/spark-submit ./examples/src/main/python/avro_inputformat.py >> "hdfs://localhost:8020/path/to/file.avro" >> >> I ended up with the same error. The full stacktrace is below. >> >> Traceback (most recent call last): >> File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", >> line 77, in >> conf=conf) >> File "/git/spark/dist/python/pyspark/context.py", line 503, in >> newAPIHadoopFile >> jconf, batchSize) >> File >> "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line >&
Re: PySpark 1.2 Hadoop version mismatch
No, "mr1" should not be the issue here, and I think that would break other things. The OP is not using mr1. client 4 / server 7 means roughly "client is Hadoop 1.x, server is Hadoop 2.0.x". Normally, I'd say I think you are packaging Hadoop code in your app by brining in Spark and its deps. Your app shouldn't have any of this code. If you're running straight examples though, I'm less sure. If anything, your client is later than your server. I wonder if you have anything else set on your local classpath via env variables? On Thu, Feb 12, 2015 at 6:52 AM, Akhil Das wrote: > Did you have a look at > http://spark.apache.org/docs/1.2.0/building-spark.html > > I think you can simply download the source and build for your hadoop version > as: > > mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package > > > Thanks > Best Regards > > On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario > wrote: >> >> I also forgot some other information. I have made this error go away by >> making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but >> communicate with a spark 1.2 master and worker. It's not a good workaround, >> so I would like to have the driver also be spark 1.2 >> >> Michael >> >> From: Michael Nazario >> Sent: Wednesday, February 11, 2015 10:13 PM >> To: user@spark.apache.org >> Subject: PySpark 1.2 Hadoop version mismatch >> >> Hi Spark users, >> >> I seem to be having this consistent error which I have been trying to >> reproduce and narrow down the problem. I've been running a PySpark >> application on Spark 1.2 reading avro files from Hadoop. I was consistently >> seeing the following error: >> >> py4j.protocol.Py4JJavaError: An error occurred while calling >> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. >> : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot >> communicate with client version 4 >> >> After some searching, I noticed that this most likely meant my hadoop >> versions were mismatched. I had the following versions at the time: >> >> Hadoop: hadoop-2.0.0-cdh4.7.0 >> Spark: spark-1.2.0-bin-cdh4.2.0 >> >> In the past, I've never had a problem with this setup for Spark 1.1.1 or >> Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong >> about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag: >> >> ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0 >> >> I then retried my previously mentioned application with this new build of >> Spark. Same error. >> >> To narrow down the problem some more, I figured I should try out the >> example which comes with spark which allows you to load an avro file. I ran >> the below command (I know it uses a deprecated way of passing jars to the >> driver classpath): >> >> >> SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH" >> bin/spark-submit ./examples/src/main/python/avro_inputformat.py >> "hdfs://localhost:8020/path/to/file.avro" >> >> I ended up with the same error. The full stacktrace is below. >> >> Traceback (most recent call last): >> File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", >> line 77, in >> conf=conf) >> File "/git/spark/dist/python/pyspark/context.py", line 503, in >> newAPIHadoopFile >> jconf, batchSize) >> File >> "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line >> 538, in __call__ >> File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", >> line 300, in get_return_value >> py4j.protocol.Py4JJavaError: An error occurred while calling >> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. >> : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot >> communicate with client version 4 >> at org.apache.hadoop.ipc.Client.call(Client.java:1113) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) >> at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) >> at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) >> at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) >> at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183) >> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:281) >> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:245) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100) >> at >> org.ap
Re: PySpark 1.2 Hadoop version mismatch
Did you have a look at http://spark.apache.org/docs/1.2.0/building-spark.html I think you can simply download the source and build for your hadoop version as: mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package Thanks Best Regards On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario wrote: > I also forgot some other information. I have made this error go away by > making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but > communicate with a spark 1.2 master and worker. It's not a good workaround, > so I would like to have the driver also be spark 1.2 > > Michael > -- > *From:* Michael Nazario > *Sent:* Wednesday, February 11, 2015 10:13 PM > *To:* user@spark.apache.org > *Subject:* PySpark 1.2 Hadoop version mismatch > > Hi Spark users, > > I seem to be having this consistent error which I have been trying to > reproduce and narrow down the problem. I've been running a PySpark > application on Spark 1.2 reading avro files from Hadoop. I was consistently > seeing the following error: > > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. > : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot > communicate with client version 4 > > After some searching, I noticed that this most likely meant my hadoop > versions were mismatched. I had the following versions at the time: > >- Hadoop: hadoop-2.0.0-cdh4.7.0 >- Spark: spark-1.2.0-bin-cdh4.2.0 > > In the past, I've never had a problem with this setup for Spark 1.1.1 or > Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong > about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag: > > ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0 > > I then retried my previously mentioned application with this new build of > Spark. Same error. > > To narrow down the problem some more, I figured I should try out the > example which comes with spark which allows you to load an avro file. I ran > the below command (I know it uses a deprecated way of passing jars to the > driver classpath): > > SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH" > bin/spark-submit ./examples/src/main/python/avro_inputformat.py > "hdfs://localhost:8020/path/to/file.avro" > > I ended up with the same error. The full stacktrace is below. > > Traceback (most recent call last): > File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", > line 77, in > conf=conf) > File "/git/spark/dist/python/pyspark/context.py", line 503, in > newAPIHadoopFile > jconf, batchSize) > File > "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. > : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot > communicate with client version 4 > at org.apache.hadoop.ipc.Client.call(Client.java:1113) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) > at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) > at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) > at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) > at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:281) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:245) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372) > at > org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774) > at > org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514) > at > org.apache.spark.api.python.PythonRDD$.n
RE: PySpark 1.2 Hadoop version mismatch
I also forgot some other information. I have made this error go away by making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but communicate with a spark 1.2 master and worker. It's not a good workaround, so I would like to have the driver also be spark 1.2 Michael From: Michael Nazario Sent: Wednesday, February 11, 2015 10:13 PM To: user@spark.apache.org Subject: PySpark 1.2 Hadoop version mismatch Hi Spark users, I seem to be having this consistent error which I have been trying to reproduce and narrow down the problem. I've been running a PySpark application on Spark 1.2 reading avro files from Hadoop. I was consistently seeing the following error: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 After some searching, I noticed that this most likely meant my hadoop versions were mismatched. I had the following versions at the time: * Hadoop: hadoop-2.0.0-cdh4.7.0 * Spark: spark-1.2.0-bin-cdh4.2.0 In the past, I've never had a problem with this setup for Spark 1.1.1 or Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag: ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0 I then retried my previously mentioned application with this new build of Spark. Same error. To narrow down the problem some more, I figured I should try out the example which comes with spark which allows you to load an avro file. I ran the below command (I know it uses a deprecated way of passing jars to the driver classpath): SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH" bin/spark-submit ./examples/src/main/python/avro_inputformat.py "hdfs://localhost:8020/path/to/file.avro" I ended up with the same error. The full stacktrace is below. Traceback (most recent call last): File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", line 77, in conf=conf) File "/git/spark/dist/python/pyspark/context.py", line 503, in newAPIHadoopFile jconf, batchSize) File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1113) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:281) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:245) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469) at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvo