I’ve been working on a format update lately to address some problems we’ve been hitting with our large tables:
- Metadata file size is very large, causing commits to take a long time - Tables with millions of files take more than a minute to plan scan tasks because so many files are in the table’s manifests - Other operations like expiring snapshots that require processing a filtered list of files take too long The first problem is caused by a large number of manifests in a large number of snapshots written into every metadata file. As the rate of commits to a table goes up, so does the number of metadata files. Even with data compaction, each commit is going to copy most of the manifest list and write it along with all the other valid snapshots. When writing every 2 minutes, this ends up being about 2,000 snapshots over 3 days, each keeping track of all the current manifests. Moving the manifest list out of the metadata file is a good way to keep this metadata small with a large number of valid snapshots. The other two problems are caused by not having enough metadata about each manifest to skip or process it, so we end up processing every manifest. For a table scan, knowing the range of values for each partition field helps eliminate whole manifests. For snapshot expiration, knowing the number of deleted files in a snapshot allows skipping manifests with no deletes. To solve both of these problems, I’ve implemented a spec change that adds a “manifest-list” in place of the “manifests” list in table metadata. The manifest list is the location of an Avro file with an entry for every manifest and all of the metadata needed to compact manifests, filter for certain operations, and filter for table scans. The details are on PR #21 <https://github.com/apache/incubator-iceberg/pull/21>. This also depends on tracking partition specs by ID, so that the spec used to write each manifest is tracked in the manifest list by its ID. That change is PR #3 <https://github.com/apache/incubator-iceberg/pull/3>. Since both of these are spec changes, I wanted to highlight them for the community. Feel free to comment on the PRs or discuss in this thread. I think the changes are straightforward, and are good to get done before the first Apache release. Thanks, rb -- Ryan Blue Software Engineer Netflix