[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766613#comment-16766613
 ] 

Yuri Budilov commented on SPARK-26777:
--------------------------------------

I split the above query from 1 into 3 and it worked OK on 2.4.0 EMR Spark, so I 
was able to fix my production job on latest EMR Spark 2.4, which is what I am 
paid to do. This SQL statement works "as is" on Presto/Athena, so it is valid 
SQL code on Athena and on EMR Spark 2.3.x. 

Since I posted this, I also tried using *explicit* query table *aliases* 
everywhere in SQL (i.e. *_select_* *_from table_name as alias z_*, in all 
*sub-select* statements), it still *failed even with explicit aliases.* 

Regarding your request: sorry, I do not and will not use Scala and I will not 
build a standalone Spark cluster on-premises either - because both of these 
tests are totally irrelevant to me, even if they do work OK in 2.4, it of no 
value to me, the end user of Spark.

Fact is - it does not work on PySpark Spark 2.4 EMR but it used to work OK on 
Pyspark EMR Spark 2.3.x.

This is all I care about. It looks like a regression bug to me. If you folks 
don't want to fix it, this is your choice, I do understand this is the world of 
free-software so I do not expect anything.

If the problem is not a bug but a feature-by-design and if I am doing something 
wrong with my SQL then some documentation would be greatly appreciated so I and 
future readers can learn what not to do on Spark SQL. Perhaps delete this Jira 
Issue so it does not mislead anyone else.

 

 

 

 

 

 

> SQL worked in 2.3.2 and fails in 2.4.0
> --------------------------------------
>
>                 Key: SPARK-26777
>                 URL: https://issues.apache.org/jira/browse/SPARK-26777
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Yuri Budilov
>            Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
> ------------ PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string, `message.item.detailsurl` string, 
> `message.item.pricetype` string, `message.item.description` string, 
> `message.item.colour` string, `message.item.badge` string, 
> `message.item.odometer` string, `message.environment.requestheader` string, 
> `message.item.registrationnumber` string, `message.item.bodytype` string, 
> `message.item.fueltype` string, `message.item.redbookcode` string, 
> `message.item.spotid` string, `message.item.id` string, 
> `message.item.transmission` string, `message.item.vin` string, 
> `message.item.enginedescription` string, `message.prospect.mobilephone` 
> string, `message.prospect.membertrackingid` string, 
> `message.environment.username` string, `message.prospect.workphone` string, 
> `message.environment.servername` string, `message.environment.sessionid` 
> string, `message.tradein.type` string, `message.tradein.model` string, 
> `message.tradein.year` string, `message.tradein.make` string, 
> `message.tradein.kms` string, `message.fraudinfo.servertimestamp` string, 
> `message.prospect.suburb` string, `message.callconnect.username` string, 
> `message.callconnect.password` string, `message.status` string, 
> `message.tradein.colour` string, `message.prospect.address` string, 
> `message.prospect.state` string, `message.tradein.detailsurl` string, 
> `message.prospect.faxnumber` string, `message.prospect.companyname` string, 
> `message.prospect.title` string, `message.callconnect.calloutcometext` 
> string, `message.prospect.preferredcontacttime` string, 
> `message.fraudinfo.devicedata` string, 
> `message.prospect.preferredcontactmethod` string, 
> `message.assignment.assigned` string, `message.assignment.email` string, 
> `message.assignment.name` string, `message.country` string, 
> `message.financepackage.id` string, `message.financepackage.version` string)
> PARTITIONED BY ( `partition_year_utc` string, `partition_month_utc` string, 
> `partition_day_utc` string, `job_run_guid` string)
> ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://datalake/yurib_test/leads_notification'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to