[ https://issues.apache.org/jira/browse/LIVY-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Bronte updated LIVY-504: ----------------------------- Summary: Livy pyspark sqlContext behavior does not match pyspark shell (was: Pyspark sqlContext behavior does not match pyspark shell) > Livy pyspark sqlContext behavior does not match pyspark shell > ------------------------------------------------------------- > > Key: LIVY-504 > URL: https://issues.apache.org/jira/browse/LIVY-504 > Project: Livy > Issue Type: Bug > Components: Core > Affects Versions: 0.5.0 > Environment: AWS EMR 5.16.0 > Reporter: Adam Bronte > Priority: Major > > On 0.5.0 I'm seeing inconsistent behavior through Livy regarding the spark > context and sqlContext compared to the pyspark shell. > For example running this through the pyspark shell works: > {code:java} > [root@ip-10-0-0-32 ~]# pyspark > Python 2.7.14 (default, May 2 2018, 18:31:34) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > 18/08/28 18:50:37 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 > /_/ > Using Python version 2.7.14 (default, May 2 2018 18:31:34) > SparkSession available as 'spark'. > >>> from pyspark.sql import SQLContext > >>> my_sql_context = SQLContext.getOrCreate(sc) > >>> df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet') > >>> print(df.count()) > 67556724 > {code} > But through Livy, the same code throws an exception > {code:java} > from pyspark.sql import SQLContext > my_sql_context = SQLContext.getOrCreate(sc) > df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet') > 'JavaMember' object has no attribute 'read' > Traceback (most recent call last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line > 433, in read > return DataFrameReader(self) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 70, in __init__ > self._jreader = spark._ssql_ctx.read() > AttributeError: 'JavaMember' object has no attribute 'read'{code} > Also trying to use the default initialized sqlContext throws the same error > {code:java} > df = sqlContext.read.parquet('s3://my-bucket/mydata.parquet') > 'JavaMember' object has no attribute 'read' > Traceback (most recent call last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line > 433, in read > return DataFrameReader(self) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 70, in __init__ > self._jreader = spark._ssql_ctx.read() > AttributeError: 'JavaMember' object has no attribute 'read'{code} > In both the spark shell and the livy versions, the objects look the same. > pyspark shell: > {code:java} > >>> print(sc) > <SparkContext master=yarn appName=PySparkShell> > >>> print(sqlContext) > <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450> > >>> print(my_sql_context) > <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450>{code} > livy: > {code:java} > print(sc) > <SparkContext master=yarn appName=livy-session-1> > print(sqlContext) > <pyspark.sql.context.SQLContext object at 0x7f478c06b850> > print(my_sql_context) > <pyspark.sql.context.SQLContext object at 0x7f478c06b850>{code} > I'm running this through sparkmagic but also have confirmed this is the same > behavior when calling the api directly. > {code:java} > curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: > application/json" localhost:8998/sessions | python -m json.tool > { > "appId": null, > "appInfo": { > "driverLogUrl": null, > "sparkUiUrl": null > }, > "id": 3, > "kind": "pyspark", > "log": [ > "stdout: ", > "\nstderr: ", > "\nYARN Diagnostics: " > ], > "owner": null, > "proxyUser": null, > "state": "starting" > } > {code} > {code:java} > curl --silent localhost:8998/sessions/3/statements -X POST -H 'Content-Type: > application/json' -d '{"code":"df = > sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")"}' | python -m > json.tool > { > "code": "df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")", > "id": 1, > "output": null, > "progress": 0.0, > "state": "running" > } > {code} > When running on 0.4.0 both pyspark shell and livy versions worked. -- This message was sent by Atlassian JIRA (v7.6.3#76005)