Adam Bronte created LIVY-504: -------------------------------- Summary: Pyspark sqlContext behavior does not my spark shell Key: LIVY-504 URL: https://issues.apache.org/jira/browse/LIVY-504 Project: Livy Issue Type: Bug Components: Core Affects Versions: 0.5.0 Environment: AWS EMR 5.16.0 Reporter: Adam Bronte
On 0.5.0 I'm seeing inconsistent behavior through Livy regarding the spark context and sqlContext compared to the pyspark shell. For example running this through the pyspark shell works: {code:java} [root@ip-10-0-0-32 ~]# pyspark Python 2.7.14 (default, May 2 2018, 18:31:34) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 Type "help", "copyright", "credits" or "license" for more information. 18/08/28 18:50:37 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Python version 2.7.14 (default, May 2 2018 18:31:34) SparkSession available as 'spark'. >>> from pyspark.sql import SQLContext >>> my_sql_context = SQLContext.getOrCreate(sc) >>> df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet') >>> print(df.count()) 67556724 {code} But through Livy, the same code throws an exception {code:java} from pyspark.sql import SQLContext my_sql_context = SQLContext.getOrCreate(sc) df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet') 'JavaMember' object has no attribute 'read' Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in read return DataFrameReader(self) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 70, in __init__ self._jreader = spark._ssql_ctx.read() AttributeError: 'JavaMember' object has no attribute 'read'{code} Also trying to use the default initialized sqlContext throws the same error {code:java} df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet') 'JavaMember' object has no attribute 'read' Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 433, in read return DataFrameReader(self) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 70, in __init__ self._jreader = spark._ssql_ctx.read() AttributeError: 'JavaMember' object has no attribute 'read'{code} In both the spark shell and the livy versions, the objects look the same. pyspark shell: {code:java} >>> print(sc) <SparkContext master=yarn appName=PySparkShell> >>> print(sqlContext) <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450> >>> print(my_sql_context) <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450>{code} livy: {code:java} print(sc) <SparkContext master=yarn appName=livy-session-1> print(sqlContext) <pyspark.sql.context.SQLContext object at 0x7f478c06b850> print(my_sql_context) <pyspark.sql.context.SQLContext object at 0x7f478c06b850>{code} I'm running this through sparkmagic but also have confirmed this is the same behavior when calling the api directly. {code:java} curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions | python -m json.tool { "appId": null, "appInfo": { "driverLogUrl": null, "sparkUiUrl": null }, "id": 3, "kind": "pyspark", "log": [ "stdout: ", "\nstderr: ", "\nYARN Diagnostics: " ], "owner": null, "proxyUser": null, "state": "starting" } {code} {code:java} curl --silent localhost:8998/sessions/3/statements -X POST -H 'Content-Type: application/json' -d '{"code":"df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")"}' | python -m json.tool { "code": "df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")", "id": 1, "output": null, "progress": 0.0, "state": "running" } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)