RE: PySpark 1.2 Hadoop version mismatch

2015-02-12 Thread Michael Nazario
I looked at the environment which I ran the spark-submit command in, and it 
looks like there is nothing that could be messing with the classpath.

Just to be sure, I checked the web UI which says the classpath contains:
- The two jars I added: /path/to/avro-mapred-1.7.4-hadoop2.jar and 
lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar
- The spark assembly jar in the same location: 
/path/to/spark/lib/spark-assembly-1.2.0-hadoop2.0.0-cdh4.7.0.jar
- The conf folder: /path/to/spark/conf
- The python script I was running


From: Sean Owen [so...@cloudera.com]
Sent: Thursday, February 12, 2015 12:13 AM
To: Akhil Das
Cc: Michael Nazario; user@spark.apache.org
Subject: Re: PySpark 1.2 Hadoop version mismatch

No, mr1 should not be the issue here, and I think that would break
other things. The OP is not using mr1.

client 4 / server 7 means roughly client is Hadoop 1.x, server is
Hadoop 2.0.x. Normally, I'd say I think you are packaging Hadoop code
in your app by brining in Spark and its deps. Your app shouldn't have
any of this code.

If you're running straight examples though, I'm less sure. If
anything, your client is later than your server. I wonder if you have
anything else set on your local classpath via env variables?


On Thu, Feb 12, 2015 at 6:52 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 Did you have a look at
 https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_1.2.0_building-2Dspark.htmld=AwIBaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=yN4Yj1JskMkGMKoYoLUUIQViRLGShPc1wislP1YdU4gm=qSlV9TOMGsnfA_9xycNM5biA5h11naL5ZuLVhMrxpHQs=vWYYgDt86TQENpK2Il3JBZHTEqQe3_bRp4TA83PUjkce=

 I think you can simply download the source and build for your hadoop version
 as:

 mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package


 Thanks
 Best Regards

 On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario mnaza...@palantir.com
 wrote:

 I also forgot some other information. I have made this error go away by
 making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but
 communicate with a spark 1.2 master and worker. It's not a good workaround,
 so I would like to have the driver also be spark 1.2

 Michael
 
 From: Michael Nazario
 Sent: Wednesday, February 11, 2015 10:13 PM
 To: user@spark.apache.org
 Subject: PySpark 1.2 Hadoop version mismatch

 Hi Spark users,

 I seem to be having this consistent error which I have been trying to
 reproduce and narrow down the problem. I've been running a PySpark
 application on Spark 1.2 reading avro files from Hadoop. I was consistently
 seeing the following error:

 py4j.protocol.Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
 : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot
 communicate with client version 4

 After some searching, I noticed that this most likely meant my hadoop
 versions were mismatched. I had the following versions at the time:

 Hadoop: hadoop-2.0.0-cdh4.7.0
 Spark: spark-1.2.0-bin-cdh4.2.0

 In the past, I've never had a problem with this setup for Spark 1.1.1 or
 Spark 1.0.2. I figured it was worth me rebuilding Spark in case I was wrong
 about versions. To rebuild my Spark, I ran this command on the v1.2.0 tag:

 ./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0

 I then retried my previously mentioned application with this new build of
 Spark. Same error.

 To narrow down the problem some more, I figured I should try out the
 example which comes with spark which allows you to load an avro file. I ran
 the below command (I know it uses a deprecated way of passing jars to the
 driver classpath):


 SPARK_CLASSPATH=/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH
 bin/spark-submit ./examples/src/main/python/avro_inputformat.py
 hdfs://localhost:8020/path/to/file.avro

 I ended up with the same error. The full stacktrace is below.

 Traceback (most recent call last):
   File /git/spark/dist/./examples/src/main/python/avro_inputformat.py,
 line 77, in module
 conf=conf)
   File /git/spark/dist/python/pyspark/context.py, line 503, in
 newAPIHadoopFile
 jconf, batchSize)
   File
 /git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line
 538, in __call__
   File /git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
 : org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1113)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
 at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method

PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Michael Nazario
Hi Spark users,

I seem to be having this consistent error which I have been trying to reproduce 
and narrow down the problem. I've been running a PySpark application on Spark 
1.2 reading avro files from Hadoop. I was consistently seeing the following 
error:

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4

After some searching, I noticed that this most likely meant my hadoop versions 
were mismatched. I had the following versions at the time:

  *   Hadoop: hadoop-2.0.0-cdh4.7.0
  *   Spark: spark-1.2.0-bin-cdh4.2.0

In the past, I've never had a problem with this setup for Spark 1.1.1 or Spark 
1.0.2. I figured it was worth me rebuilding Spark in case I was wrong about 
versions. To rebuild my Spark, I ran this command on the v1.2.0 tag:

./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0

I then retried my previously mentioned application with this new build of 
Spark. Same error.

To narrow down the problem some more, I figured I should try out the example 
which comes with spark which allows you to load an avro file. I ran the below 
command (I know it uses a deprecated way of passing jars to the driver 
classpath):

SPARK_CLASSPATH=/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH
 bin/spark-submit ./examples/src/main/python/avro_inputformat.py 
hdfs://localhost:8020/path/to/file.avro

I ended up with the same error. The full stacktrace is below.

Traceback (most recent call last):
  File /git/spark/dist/./examples/src/main/python/avro_inputformat.py, line 
77, in module
conf=conf)
  File /git/spark/dist/python/pyspark/context.py, line 503, in 
newAPIHadoopFile
jconf, batchSize)
  File /git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, 
line 538, in __call__
  File /git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:281)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:245)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514)
at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:724)

I could foresee that possibly my avro-mapred jar is a problem. However it is 

RE: PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Michael Nazario
I also forgot some other information. I have made this error go away by making 
my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but communicate 
with a spark 1.2 master and worker. It's not a good workaround, so I would like 
to have the driver also be spark 1.2

Michael

From: Michael Nazario
Sent: Wednesday, February 11, 2015 10:13 PM
To: user@spark.apache.org
Subject: PySpark 1.2 Hadoop version mismatch

Hi Spark users,

I seem to be having this consistent error which I have been trying to reproduce 
and narrow down the problem. I've been running a PySpark application on Spark 
1.2 reading avro files from Hadoop. I was consistently seeing the following 
error:

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4

After some searching, I noticed that this most likely meant my hadoop versions 
were mismatched. I had the following versions at the time:

  *   Hadoop: hadoop-2.0.0-cdh4.7.0
  *   Spark: spark-1.2.0-bin-cdh4.2.0

In the past, I've never had a problem with this setup for Spark 1.1.1 or Spark 
1.0.2. I figured it was worth me rebuilding Spark in case I was wrong about 
versions. To rebuild my Spark, I ran this command on the v1.2.0 tag:

./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0

I then retried my previously mentioned application with this new build of 
Spark. Same error.

To narrow down the problem some more, I figured I should try out the example 
which comes with spark which allows you to load an avro file. I ran the below 
command (I know it uses a deprecated way of passing jars to the driver 
classpath):

SPARK_CLASSPATH=/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH
 bin/spark-submit ./examples/src/main/python/avro_inputformat.py 
hdfs://localhost:8020/path/to/file.avro

I ended up with the same error. The full stacktrace is below.

Traceback (most recent call last):
  File /git/spark/dist/./examples/src/main/python/avro_inputformat.py, line 
77, in module
conf=conf)
  File /git/spark/dist/python/pyspark/context.py, line 503, in 
newAPIHadoopFile
jconf, batchSize)
  File /git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, 
line 538, in __call__
  File /git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:281)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:245)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514)
at 
org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke

Job status from Python

2014-12-11 Thread Michael Nazario
In PySpark, is there a way to get the status of a job which is currently 
running? My use case is that I have a long running job that users may not know 
whether or not the job is still running. It would be nice to have an idea of 
whether or not the job is progressing even if it isn't very granular.

I've looked into the Application detailed UI which has per-stage information 
(but unfortunately is not in json format), but even at that point I don't 
necessarily know which stages correspond to a job I started.

So I guess my main questions are:

  1.  How do I get the job id of a job started in python?
  2.  If possible, how do I get the stages which correspond to that job?
  3.  Is there any way to get information about currently running stages 
without parsing the Stage UI HTML page?
  4.  Has anyone approached this problem in a different way?