[ 
https://issues.apache.org/jira/browse/SPARK-33559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaetan updated SPARK-33559:
---------------------------
    Description: 
{code:java}
df = ss.read.parquet("/path/to/parquet/dataset") 
df.select("partnerid").withColumn("index", 
sf.monotonically_increasing_id()).explain(True){code}

 {{We should expect to only read partnerid from parquet dataset but we actually 
read the whole dataset:}}


{code:java}
... == Physical Plan == Project [partnerid#6794, monotonically_increasing_id() 
AS index#24939L] +- FileScan parquet 
[impression_id#6550,arbitrage_id#6551,display_timestamp#6552L,requesttimestamputc#6553,affiliateid#6554,amp_adrequest_type#6555,app_id#6556,app_name#6557,appnexus_viewability#6558,apxpagevertical#6559,arbitrage_time#6560,banner_type#6561,display_type_int#6562,has_multiple_display_types#6563,bannerid#6564,bid_app_id_hash#6565,bid_url_domain_hash#6566,bidding_details#6567,bid_level_core#6568,bidrandomization_user_factor#6569,bidrandomization_user_mu#6570,bidrandomization_user_sigma#6571,big_lastrequesttimestampsession#6572,big_nbrequestaffiliatesession#6573,...
 566 more fields] ...{code}

  was:
{{}}{{}}
{code:java}
df = ss.read.parquet("/path/to/parquet/dataset") 
df.select("partnerid").withColumn("index", 
sf.monotonically_increasing_id()).explain(True){code}
{{}}
{{We should expect to only read partnerid from parquet dataset but we actually 
read the whole dataset:}}
{{}}
{code:java}
... == Physical Plan == Project [partnerid#6794, monotonically_increasing_id() 
AS index#24939L] +- FileScan parquet 
[impression_id#6550,arbitrage_id#6551,display_timestamp#6552L,requesttimestamputc#6553,affiliateid#6554,amp_adrequest_type#6555,app_id#6556,app_name#6557,appnexus_viewability#6558,apxpagevertical#6559,arbitrage_time#6560,banner_type#6561,display_type_int#6562,has_multiple_display_types#6563,bannerid#6564,bid_app_id_hash#6565,bid_url_domain_hash#6566,bidding_details#6567,bid_level_core#6568,bidrandomization_user_factor#6569,bidrandomization_user_mu#6570,bidrandomization_user_sigma#6571,big_lastrequesttimestampsession#6572,big_nbrequestaffiliatesession#6573,...
 566 more fields] ...{code}
{{}}


> Column pruning with monotonically_increasing_id
> -----------------------------------------------
>
>                 Key: SPARK-33559
>                 URL: https://issues.apache.org/jira/browse/SPARK-33559
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.1
>            Reporter: Gaetan
>            Priority: Minor
>
> {code:java}
> df = ss.read.parquet("/path/to/parquet/dataset") 
> df.select("partnerid").withColumn("index", 
> sf.monotonically_increasing_id()).explain(True){code}
>  {{We should expect to only read partnerid from parquet dataset but we 
> actually read the whole dataset:}}
> {code:java}
> ... == Physical Plan == Project [partnerid#6794, 
> monotonically_increasing_id() AS index#24939L] +- FileScan parquet 
> [impression_id#6550,arbitrage_id#6551,display_timestamp#6552L,requesttimestamputc#6553,affiliateid#6554,amp_adrequest_type#6555,app_id#6556,app_name#6557,appnexus_viewability#6558,apxpagevertical#6559,arbitrage_time#6560,banner_type#6561,display_type_int#6562,has_multiple_display_types#6563,bannerid#6564,bid_app_id_hash#6565,bid_url_domain_hash#6566,bidding_details#6567,bid_level_core#6568,bidrandomization_user_factor#6569,bidrandomization_user_mu#6570,bidrandomization_user_sigma#6571,big_lastrequesttimestampsession#6572,big_nbrequestaffiliatesession#6573,...
>  566 more fields] ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to