Re: Partition field not being utilized in query

Sandeep Sagar Mon, 28 Oct 2019 09:17:30 -0700

Yes, I had realized that and updated the code but was deploying the wrong 
version and then wondering why it was not working 😊

From: "[email protected]" <[email protected]>
Reply-To: "[email protected]" <[email protected]>, Ryan Blue 
<[email protected]>
Date: Sunday, October 27, 2019 at 2:11 PM
To: "[email protected]" <[email protected]>
Subject: Re: Partition field not being utilized in query

Thanks for the update, Sandeep. Looks like the problem in the older version was 
that your filters were getting run as string comparisons instead of timestamp 
comparisons. Spark doesn't know how to push down a filter like `cast(ts as 
string) < '...'`.

On Fri, Oct 25, 2019 at 5:53 PM Sandeep Sagar 
<[email protected]<mailto:[email protected]>> wrote:
Please ignore my email below.  The code WORKS!

I had deployed the wrong version on Spark.
Apologies for confusion.

-Sandeep

From: Sandeep Sagar 
<[email protected]<mailto:[email protected]>>
Date: Friday, October 25, 2019 at 5:37 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Partition field not being utilized in query

Hi
Spark SQL query seems to be doing a table scan instead of utilizing partitions 
in Iceberg.

I have created a partition using the spec as follows:

public PartitionSpec getPartitionSpec() {
PartitionSpec.Builder icebergBuilder = PartitionSpec.builderFor(getSchema());
icebergBuilder.hour(FIELD_NAME.CREATED_AT);
return icebergBuilder.build();
}
My expectation was that iceberg would implement hidden partitioning on 
CREATED_AT field which is of type Timestamp. When I look at S3, it seems to 
have created hourly partitions ( great!)

While running the Query I load the Table as follows-

dlS3Connector.getSparkSession().read().format("iceberg")
        .load(getTableLocation())          // S3 Bucket
        .where(new Column(TweetItem.FIELD_NAME.CREATED_AT).$greater$eq(new 
Timestamp(startDate))
                .and(new Column(TweetItem.FIELD_NAME.CREATED_AT).lt(new 
Timestamp(endDate)))
                .and(new Column(TweetItem.FIELD_NAME.TEXT).rlike(regExp)))
        .as(Tweet.getEncoder());

But on read it is doing a table scan as per
2019-10-26 00:04:27.731 ^[[32m[INFO ]^[[m  [main] 
o.a.s.s.e.d.v.DataSourceV2Strategy (Logging.scala:54) -
Pushing operators to class org.apache.iceberg.spark.source.IcebergSource
Pushed Filters: isnotnull(created_at#5), isnotnull(text#4)
Post-Scan Filters: (cast(created_at#5 as string) > 2019-04-01 
04:35:06.0),(cast(created_at#5 as string) < 2019-04-01 04:40:06.0),text#4 RLIKE 
hackathon|understand|Trump,isnotnull(created_at#5),isnotnull(text#4)
Output: mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, 
created_at_ms#7L

2019-10-26 00:04:27.764 ^[[32m[INFO ]^[[m  [main] o.a.i.TableScan 
(BaseTableScan.java:178) - Scanning table s3a://hackathon-hour/ snapshot 
5785804775998605063 created at 2019-10-24 09:30:26.550 with filter 
(not_null(ref(name="created_at")) and not_null(ref(name="text")))

The physical plan is displayed as -
== Physical Plan ==
*(1) Project [mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, 
lang#6, created_at_ms#7L]
+- *(1) Filter (((((cast(created_at#5 as string) > 2019-04-01 04:35:06.0) && 
(cast(created_at#5 as string) < 2019-04-01 04:40:06.0)) && text#4 RLIKE 
hackathon|understand|Trump) && isnotnull(created_at#5)) && isnotnull(text#4))
   +- *(1) ScanV2 iceberg[mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, 
created_at#5, lang#6, created_at_ms#7L] (Filters: [isnotnull(created_at#5), 
isnotnull(text#4)], Options: [path=s3a://hackathon-hour/,paths=[]])

Please let me know where I am going wrong here.

Iceberg version used is - 57b1099.dirty
Thanks
Sandeep

The information contained in this email may be confidential. It has been sent 
for the sole use of the intended recipient(s). If the reader of this email is 
not an intended recipient, you are hereby notified that any unauthorized 
review, use, disclosure, dissemination, distribution, or copying of this 
message is strictly prohibited. If you have received this email in error, 
please notify the sender immediately and destroy all copies of the message.

--
Ryan Blue
Software Engineer
Netflix

-- 
The
 information contained in this email may be confidential. It has been 

sent for the sole use of the intended recipient(s). If the
reader of this 
email is not an intended recipient, you are hereby 
notified that any 
unauthorized review, use, disclosure, dissemination, 
distribution, or 
copying of this message is strictly prohibited. If you 
have received this 
email in error, please notify
the sender immediately and destroy all copies 
of the message.

Re: Partition field not being utilized in query

Reply via email to