Yes, I had realized that and updated the code but was deploying the wrong version and then wondering why it was not working 😊
From: "[email protected]" <[email protected]> Reply-To: "[email protected]" <[email protected]>, Ryan Blue <[email protected]> Date: Sunday, October 27, 2019 at 2:11 PM To: "[email protected]" <[email protected]> Subject: Re: Partition field not being utilized in query Thanks for the update, Sandeep. Looks like the problem in the older version was that your filters were getting run as string comparisons instead of timestamp comparisons. Spark doesn't know how to push down a filter like `cast(ts as string) < '...'`. On Fri, Oct 25, 2019 at 5:53 PM Sandeep Sagar <[email protected]<mailto:[email protected]>> wrote: Please ignore my email below. The code WORKS! I had deployed the wrong version on Spark. Apologies for confusion. -Sandeep From: Sandeep Sagar <[email protected]<mailto:[email protected]>> Date: Friday, October 25, 2019 at 5:37 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Partition field not being utilized in query Hi Spark SQL query seems to be doing a table scan instead of utilizing partitions in Iceberg. I have created a partition using the spec as follows: public PartitionSpec getPartitionSpec() { PartitionSpec.Builder icebergBuilder = PartitionSpec.builderFor(getSchema()); icebergBuilder.hour(FIELD_NAME.CREATED_AT); return icebergBuilder.build(); } My expectation was that iceberg would implement hidden partitioning on CREATED_AT field which is of type Timestamp. When I look at S3, it seems to have created hourly partitions ( great!) While running the Query I load the Table as follows- dlS3Connector.getSparkSession().read().format("iceberg") .load(getTableLocation()) // S3 Bucket .where(new Column(TweetItem.FIELD_NAME.CREATED_AT).$greater$eq(new Timestamp(startDate)) .and(new Column(TweetItem.FIELD_NAME.CREATED_AT).lt(new Timestamp(endDate))) .and(new Column(TweetItem.FIELD_NAME.TEXT).rlike(regExp))) .as(Tweet.getEncoder()); But on read it is doing a table scan as per 2019-10-26 00:04:27.731 ^[[32m[INFO ]^[[m [main] o.a.s.s.e.d.v.DataSourceV2Strategy (Logging.scala:54) - Pushing operators to class org.apache.iceberg.spark.source.IcebergSource Pushed Filters: isnotnull(created_at#5), isnotnull(text#4) Post-Scan Filters: (cast(created_at#5 as string) > 2019-04-01 04:35:06.0),(cast(created_at#5 as string) < 2019-04-01 04:40:06.0),text#4 RLIKE hackathon|understand|Trump,isnotnull(created_at#5),isnotnull(text#4) Output: mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, created_at_ms#7L 2019-10-26 00:04:27.764 ^[[32m[INFO ]^[[m [main] o.a.i.TableScan (BaseTableScan.java:178) - Scanning table s3a://hackathon-hour/ snapshot 5785804775998605063 created at 2019-10-24 09:30:26.550 with filter (not_null(ref(name="created_at")) and not_null(ref(name="text"))) The physical plan is displayed as - == Physical Plan == *(1) Project [mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, created_at_ms#7L] +- *(1) Filter (((((cast(created_at#5 as string) > 2019-04-01 04:35:06.0) && (cast(created_at#5 as string) < 2019-04-01 04:40:06.0)) && text#4 RLIKE hackathon|understand|Trump) && isnotnull(created_at#5)) && isnotnull(text#4)) +- *(1) ScanV2 iceberg[mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, created_at_ms#7L] (Filters: [isnotnull(created_at#5), isnotnull(text#4)], Options: [path=s3a://hackathon-hour/,paths=[]]) Please let me know where I am going wrong here. Iceberg version used is - 57b1099.dirty Thanks Sandeep The information contained in this email may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this email is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this email in error, please notify the sender immediately and destroy all copies of the message. -- Ryan Blue Software Engineer Netflix -- The information contained in this email may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this email is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this email in error, please notify the sender immediately and destroy all copies of the message.
