Re: Partition field not being utilized in query

Ryan Blue Sun, 27 Oct 2019 14:13:30 -0700

Thanks for the update, Sandeep. Looks like the problem in the older version
was that your filters were getting run as string comparisons instead of
timestamp comparisons. Spark doesn't know how to push down a filter like
`cast(ts as string) < '...'`.


On Fri, Oct 25, 2019 at 5:53 PM Sandeep Sagar <[email protected]>
wrote:

> Please *ignore my email below*.  The code WORKS!
>
> I had deployed the wrong version on Spark.
>
> Apologies for confusion.
>
>
>
> -Sandeep
>
>
>
> *From: *Sandeep Sagar <[email protected]>
> *Date: *Friday, October 25, 2019 at 5:37 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *Partition field not being utilized in query
>
>
>
> Hi
>
> Spark SQL query seems to be doing a table scan instead of utilizing
> partitions in Iceberg.
>
>
>
> I have created a partition using the spec as follows:
>
> *public *PartitionSpec getPartitionSpec() {
>
> PartitionSpec.Builder icebergBuilder = PartitionSpec.*builderFor*
> (getSchema());
>
> icebergBuilder.hour(FIELD_NAME.*CREATED_AT*);
> *return *icebergBuilder.build();
>
> }
>
> My expectation was that iceberg would implement hidden partitioning on
> CREATED_AT field which is of type Timestamp. When I look at S3, it seems to
> have created hourly partitions ( great!)
>
>
>
> While running the Query I load the Table as follows-
>
>
>
> *dlS3Connector*.getSparkSession().read().format(*"iceberg"*)
>         .load(getTableLocation())          // S3 Bucket
>         .where(*new 
> *Column(TweetItem.FIELD_NAME.*CREATED_AT*).$greater$eq(*new 
> *Timestamp(startDate))
>                 .and(*new *Column(TweetItem.FIELD_NAME.*CREATED_AT*).lt(*new 
> *Timestamp(endDate)))
>                 .and(*new *Column(TweetItem.FIELD_NAME.*TEXT*).rlike(regExp)))
>         .as(*Tweet*.getEncoder());
>
>
>
>
>
> But on read it is doing a table scan as per
>
> 2019-10-26 00:04:27.731 ^[[32m[INFO ]^[[m  [main]
> o.a.s.s.e.d.v.DataSourceV2Strategy (Logging.scala:54) -
>
> Pushing operators to class org.apache.iceberg.spark.source.IcebergSource
>
> Pushed Filters: isnotnull(created_at#5), isnotnull(text#4)
>
> Post-Scan Filters: (cast(created_at#5 as string) > 2019-04-01
> 04:35:06.0),(cast(created_at#5 as string) < 2019-04-01 04:40:06.0),text#4
> RLIKE hackathon|understand|Trump,isnotnull(created_at#5),isnotnull(text#4)
>
> Output: mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5,
> lang#6, created_at_ms#7L
>
>
>
> 2019-10-26 00:04:27.764 ^[[32m[INFO ]^[[m  [main] o.a.i.TableScan
> (BaseTableScan.java:178) - Scanning table s3a://hackathon-hour/ snapshot
> 5785804775998605063 created at 2019-10-24 09:30:26.550 with filter
> (not_null(ref(name="created_at")) and not_null(ref(name="text")))
>
>
>
> The physical plan is displayed as -
> == Physical Plan ==
>
> *(1) Project [mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5,
> lang#6, created_at_ms#7L]
>
> +- *(1) Filter (((((cast(created_at#5 as string) > 2019-04-01 04:35:06.0)
> && (cast(created_at#5 as string) < 2019-04-01 04:40:06.0)) && text#4 RLIKE
> hackathon|understand|Trump) && isnotnull(created_at#5)) &&
> isnotnull(text#4))
>
>    +- *(1) ScanV2 iceberg[mwId#0, mwVersion#1L, id#2L, id_str#3, text#4,
> created_at#5, lang#6, created_at_ms#7L] (Filters: [isnotnull(created_at#5),
> isnotnull(text#4)], Options: [path=s3a://hackathon-hour/,paths=[]])
>
>
>
> Please let me know where I am going wrong here.
>
> Iceberg version used is - 57b1099.dirty
>
> Thanks
> Sandeep
>
> The information contained in this email may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> email is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution, or
> copying of this message is strictly prohibited. If you have received this
> email in error, please notify the sender immediately and destroy all copies
> of the message.



-- 
Ryan Blue
Software Engineer
Netflix

Re: Partition field not being utilized in query

Reply via email to