Hi
Spark SQL query seems to be doing a table scan instead of utilizing partitions
in Iceberg.
I have created a partition using the spec as follows:
public PartitionSpec getPartitionSpec() {
PartitionSpec.Builder icebergBuilder = PartitionSpec.builderFor(getSchema());
icebergBuilder.hour(FIELD_NAME.CREATED_AT);
return icebergBuilder.build();
}
My expectation was that iceberg would implement hidden partitioning on
CREATED_AT field which is of type Timestamp. When I look at S3, it seems to
have created hourly partitions ( great!)
While running the Query I load the Table as follows-
dlS3Connector.getSparkSession().read().format("iceberg")
.load(getTableLocation()) // S3 Bucket
.where(new Column(TweetItem.FIELD_NAME.CREATED_AT).$greater$eq(new
Timestamp(startDate))
.and(new Column(TweetItem.FIELD_NAME.CREATED_AT).lt(new
Timestamp(endDate)))
.and(new Column(TweetItem.FIELD_NAME.TEXT).rlike(regExp)))
.as(Tweet.getEncoder());
But on read it is doing a table scan as per
2019-10-26 00:04:27.731 ^[[32m[INFO ]^[[m [main]
o.a.s.s.e.d.v.DataSourceV2Strategy (Logging.scala:54) -
Pushing operators to class org.apache.iceberg.spark.source.IcebergSource
Pushed Filters: isnotnull(created_at#5), isnotnull(text#4)
Post-Scan Filters: (cast(created_at#5 as string) > 2019-04-01
04:35:06.0),(cast(created_at#5 as string) < 2019-04-01 04:40:06.0),text#4 RLIKE
hackathon|understand|Trump,isnotnull(created_at#5),isnotnull(text#4)
Output: mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6,
created_at_ms#7L
2019-10-26 00:04:27.764 ^[[32m[INFO ]^[[m [main] o.a.i.TableScan
(BaseTableScan.java:178) - Scanning table s3a://hackathon-hour/ snapshot
5785804775998605063 created at 2019-10-24 09:30:26.550 with filter
(not_null(ref(name="created_at")) and not_null(ref(name="text")))
The physical plan is displayed as -
== Physical Plan ==
*(1) Project [mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5,
lang#6, created_at_ms#7L]
+- *(1) Filter (((((cast(created_at#5 as string) > 2019-04-01 04:35:06.0) &&
(cast(created_at#5 as string) < 2019-04-01 04:40:06.0)) && text#4 RLIKE
hackathon|understand|Trump) && isnotnull(created_at#5)) && isnotnull(text#4))
+- *(1) ScanV2 iceberg[mwId#0, mwVersion#1L, id#2L, id_str#3, text#4,
created_at#5, lang#6, created_at_ms#7L] (Filters: [isnotnull(created_at#5),
isnotnull(text#4)], Options: [path=s3a://hackathon-hour/,paths=[]])
Please let me know where I am going wrong here.
Iceberg version used is - 57b1099.dirty
Thanks
Sandeep
--
The
information contained in this email may be confidential. It has been
sent for the sole use of the intended recipient(s). If the
reader of this
email is not an intended recipient, you are hereby
notified that any
unauthorized review, use, disclosure, dissemination,
distribution, or
copying of this message is strictly prohibited. If you
have received this
email in error, please notify
the sender immediately and destroy all copies
of the message.