[ https://issues.apache.org/jira/browse/SPARK-33559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gaetan updated SPARK-33559: --------------------------- Description: {code:java} df = ss.read.parquet("/path/to/parquet/dataset") df.select("partnerid").withColumn("index", sf.monotonically_increasing_id()).explain(True){code} {{We should expect to only read partnerid from parquet dataset but we actually read the whole dataset:}} {code:java} ... == Physical Plan == Project [partnerid#6794, monotonically_increasing_id() AS index#24939L] +- FileScan parquet [impression_id#6550,arbitrage_id#6551,display_timestamp#6552L,requesttimestamputc#6553,affiliateid#6554,amp_adrequest_type#6555,app_id#6556,app_name#6557,appnexus_viewability#6558,apxpagevertical#6559,arbitrage_time#6560,banner_type#6561,display_type_int#6562,has_multiple_display_types#6563,bannerid#6564,bid_app_id_hash#6565,bid_url_domain_hash#6566,bidding_details#6567,bid_level_core#6568,bidrandomization_user_factor#6569,bidrandomization_user_mu#6570,bidrandomization_user_sigma#6571,big_lastrequesttimestampsession#6572,big_nbrequestaffiliatesession#6573,... 566 more fields] ...{code} was: {{}}{{}} {code:java} df = ss.read.parquet("/path/to/parquet/dataset") df.select("partnerid").withColumn("index", sf.monotonically_increasing_id()).explain(True){code} {{}} {{We should expect to only read partnerid from parquet dataset but we actually read the whole dataset:}} {{}} {code:java} ... == Physical Plan == Project [partnerid#6794, monotonically_increasing_id() AS index#24939L] +- FileScan parquet [impression_id#6550,arbitrage_id#6551,display_timestamp#6552L,requesttimestamputc#6553,affiliateid#6554,amp_adrequest_type#6555,app_id#6556,app_name#6557,appnexus_viewability#6558,apxpagevertical#6559,arbitrage_time#6560,banner_type#6561,display_type_int#6562,has_multiple_display_types#6563,bannerid#6564,bid_app_id_hash#6565,bid_url_domain_hash#6566,bidding_details#6567,bid_level_core#6568,bidrandomization_user_factor#6569,bidrandomization_user_mu#6570,bidrandomization_user_sigma#6571,big_lastrequesttimestampsession#6572,big_nbrequestaffiliatesession#6573,... 566 more fields] ...{code} {{}} > Column pruning with monotonically_increasing_id > ----------------------------------------------- > > Key: SPARK-33559 > URL: https://issues.apache.org/jira/browse/SPARK-33559 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.1 > Reporter: Gaetan > Priority: Minor > > {code:java} > df = ss.read.parquet("/path/to/parquet/dataset") > df.select("partnerid").withColumn("index", > sf.monotonically_increasing_id()).explain(True){code} > {{We should expect to only read partnerid from parquet dataset but we > actually read the whole dataset:}} > {code:java} > ... == Physical Plan == Project [partnerid#6794, > monotonically_increasing_id() AS index#24939L] +- FileScan parquet > [impression_id#6550,arbitrage_id#6551,display_timestamp#6552L,requesttimestamputc#6553,affiliateid#6554,amp_adrequest_type#6555,app_id#6556,app_name#6557,appnexus_viewability#6558,apxpagevertical#6559,arbitrage_time#6560,banner_type#6561,display_type_int#6562,has_multiple_display_types#6563,bannerid#6564,bid_app_id_hash#6565,bid_url_domain_hash#6566,bidding_details#6567,bid_level_core#6568,bidrandomization_user_factor#6569,bidrandomization_user_mu#6570,bidrandomization_user_sigma#6571,big_lastrequesttimestampsession#6572,big_nbrequestaffiliatesession#6573,... > 566 more fields] ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org