Partition field not being utilized in query

Sandeep Sagar Fri, 25 Oct 2019 17:38:08 -0700

Hi
Spark SQL query seems to be doing a table scan instead of utilizing partitions 
in Iceberg.


I have created a partition using the spec as follows:


public PartitionSpec getPartitionSpec() {
PartitionSpec.Builder icebergBuilder = PartitionSpec.builderFor(getSchema());
icebergBuilder.hour(FIELD_NAME.CREATED_AT);
return icebergBuilder.build();
}
My expectation was that iceberg would implement hidden partitioning on 
CREATED_AT field which is of type Timestamp. When I look at S3, it seems to 
have created hourly partitions ( great!)

While running the Query I load the Table as follows-



dlS3Connector.getSparkSession().read().format("iceberg")
        .load(getTableLocation())          // S3 Bucket
        .where(new Column(TweetItem.FIELD_NAME.CREATED_AT).$greater$eq(new 
Timestamp(startDate))
                .and(new Column(TweetItem.FIELD_NAME.CREATED_AT).lt(new 
Timestamp(endDate)))
                .and(new Column(TweetItem.FIELD_NAME.TEXT).rlike(regExp)))
        .as(Tweet.getEncoder());


But on read it is doing a table scan as per
2019-10-26 00:04:27.731 ^[[32m[INFO ]^[[m  [main] 
o.a.s.s.e.d.v.DataSourceV2Strategy (Logging.scala:54) -
Pushing operators to class org.apache.iceberg.spark.source.IcebergSource
Pushed Filters: isnotnull(created_at#5), isnotnull(text#4)
Post-Scan Filters: (cast(created_at#5 as string) > 2019-04-01 
04:35:06.0),(cast(created_at#5 as string) < 2019-04-01 04:40:06.0),text#4 RLIKE 
hackathon|understand|Trump,isnotnull(created_at#5),isnotnull(text#4)
Output: mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, 
created_at_ms#7L

2019-10-26 00:04:27.764 ^[[32m[INFO ]^[[m  [main] o.a.i.TableScan 
(BaseTableScan.java:178) - Scanning table s3a://hackathon-hour/ snapshot 
5785804775998605063 created at 2019-10-24 09:30:26.550 with filter 
(not_null(ref(name="created_at")) and not_null(ref(name="text")))


The physical plan is displayed as -
== Physical Plan ==
*(1) Project [mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, 
lang#6, created_at_ms#7L]
+- *(1) Filter (((((cast(created_at#5 as string) > 2019-04-01 04:35:06.0) && 
(cast(created_at#5 as string) < 2019-04-01 04:40:06.0)) && text#4 RLIKE 
hackathon|understand|Trump) && isnotnull(created_at#5)) && isnotnull(text#4))
   +- *(1) ScanV2 iceberg[mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, 
created_at#5, lang#6, created_at_ms#7L] (Filters: [isnotnull(created_at#5), 
isnotnull(text#4)], Options: [path=s3a://hackathon-hour/,paths=[]])

Please let me know where I am going wrong here.

Iceberg version used is - 57b1099.dirty
Thanks
Sandeep

-- 
The
 information contained in this email may be confidential. It has been 

sent for the sole use of the intended recipient(s). If the
reader of this 
email is not an intended recipient, you are hereby 
notified that any 
unauthorized review, use, disclosure, dissemination, 
distribution, or 
copying of this message is strictly prohibited. If you 
have received this 
email in error, please notify
the sender immediately and destroy all copies 
of the message.

Partition field not being utilized in query

Reply via email to