Iceberg community sync notes - 29 July 2020

Anton Okolnychyi Fri, 31 Jul 2020 17:17:29 -0700

Hey everyone,

Here are my notes from the last sync. Feel free to add/correct.

Conferences

There are three talks on Iceberg at the Dremio conference.
- "The Future of Intelligent Storage in Big Data" by Dan- "Hiveberg:
Integrating Apache Iceberg with the Hive Metastore" by Adrian and Christine
- "Lessons learned from running Apache Iceberg at PB scale" by Anton

Hive integration

Adrien: Found a bug when MR job is launched in distributed mode, @guilload and
@rdsr are taking a look at it and will propose a fix soon.
Adrien: It is hard to work with large tables as predicate push-down is not
working. Waiting for a PR from @cmathiesen and @massdosage.

Flink integration

Junjie: There is some progress on the Flink sync and the work is split into
smaller PRs that are getting merged into master.
Kyle: I’ll be interested to review.

Row-level deletes

Anton: Most of the work for core metadata is done. We have delete manifests,
sequence numbers, updated manifest lists.
Junjie: There is progress on readers to project metadata columns like row
position in Avro, Parquet, ORC.
Anton: I was supposed to start working on two-phase job planning approach but
was distracted by other things. Plan to resume looking into that.
Anton: It seems like points raised by @openinx in the CDC pipelines doc must be
resolved before moving on with any implementation.

Could not get more details as neither Ryan nor Zheng was present.

CDC open questions:
https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc
<https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc>

SQL extensions

Anton: Thanks everyone for the feedback. Looks like we almost have consensus on
how that should look like. There is one open question raised by Carl.
Carl: How will the currently proposed approach that relies on stored procedures
work with role-based access control? Presto has supports for this.
Anton: We can limit the access to stored procedures but I don’t know how we can
limit calling a stored procedure on a particular table if the table name as
passed as an argument.
Carl: It feels easier with ALTER TABLE syntax.
Carl: It is better to follow up with the Presto community on this.
Anton: Agreed. It is a blocker to move forward.

Dev list discussion:
https://lists.apache.org/thread.html/rb3321727198d65246ec9eb0f938b121ec6fe5dd0face0b2fb899996a%40%3Cdev.iceberg.apache.org%3E

<https://lists.apache.org/thread.html/rb3321727198d65246ec9eb0f938b121ec6fe5dd0face0b2fb899996a@%3Cdev.iceberg.apache.org%3E>

SQL extensions doc:
https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8
<https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8>

Vectorized reads for Parquet

Anton: Cannot use vectorized reads for tables with identity partitions.
Russell: Working on a fix.

ExpireSnapshotsAction

Russell: Working on an action for expiring snapshots as the current solution is
slow for expiring a lot of snapshots. The work is split into multiple PRs that
are being merged to master.
Ratandeep: We also face this problem. We will help to review.

Secondary indexes

Miao: Working on a doc for secondary indexes in Iceberg. The solution should be
able to support multiple index implementations and should be independent from
file formats.
Miao: We have a Bloom filter implementation internally.
Anton: Do you keep a Bloom filter per file?
Miao: Yes.
Anton: Do you store it separately so it can be loaded on demand?
Miao: Yes.
Anton: Bloom filters are too big to be incorporated into the metadata directly
but it will be great to be able to load some of them on demand and use during
job planning. One of the use cases we were looking to solve is to speed up
queries with predicates on the sort key. Right now, we do min/max file pruning
but if you have 10-20 possible keys you are looking for (and they cover the
full range of values), filtering is not very effective. We want to leverage
bloom filters for this task and avoid touching data files completely. Right
now, we still have to read a dictionary before we can discard a file. That
takes ±1 sec per false data file since query engines have to spin up a task.
Kyle: We have use cases where people are looking for way more than 20 keys.
Anton: May need a very large Bloom filter to get an acceptable false positive
ratio if looking for a big number of keys at the same time. At least, 10-20
would be a great start.
Xinli: It would be great to leverage existing bloom filters for file formats
that support them (e.g. Parquet, ORC).
Anton: Can we derive a Bloom filter per file based on Bloom filters for row
groups?
Miao: Depends on Bloom filter implementation.
Anton: Is it better to keep a list of bloom filters per row group or one per
file in the metadata?

Sort spec

Anton: I have submitted a proposal for Spark that should allow data sources to
request a specific distribution and ordering on write. Iceberg SortSpec should
be based on that. Any feedback on that proposal would help.

Spark proposal:
https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs
<https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs>
Spark PR: https://github.com/apache/spark/pull/29066
<https://github.com/apache/spark/pull/29066>

Data Compaction

Anton: I am going to submit a new proposal that will be a follow-up to the SQL
extensions doc. It should cover sort-based data compaction in addition to
bin-packing.

Thanks,
Anton

Iceberg community sync notes - 29 July 2020

Reply via email to