Replies inline: On Tue, May 14, 2019 at 3:21 AM Anton Okolnychyi <[email protected]> wrote:
> I would like to resume this topic. How do we see the proper API for > migration? > > I have a couple of questions in mind: > - Now, it is based on a Spark job. Do we want to keep it that way because > the number of files might be huge? Will it be too much for the driver? > I think it is reasonable to make this a Spark job. The number of files in tables we convert typically requires it. This would only be too much for the driver if all of the files are collected at one time. We commit 500,000 files per batch, which seems to work well. That trades atomicity, though. > - What should be the scope? Do we want the migration to modify the > catalog? > What do you mean modify the catalog? We've built two SQL commands using these helper classes and methods: SNAPSHOT TABLE that creates a new table copy, and MIGRATE TABLE that basically builds a snapshot, but also renames the new table into place. > - If we stick to having Dataset[DataFiles], we will have one append per > dataset partition. How do we want to guarantee that either all files are > appended or none? One way is to rely on temporary tables but this requires > looking into the catalog. Another approach is to say that if the migration > isn’t successful, users will need to remove the metadata and retry. > I don't recommend committing from tasks. If Spark runs speculative attempts, you can get duplicate data. Another option is to add a way to append manifest files, not just data files. That way, each task produces a manifest and we can create a single commit to add all of the manifests. This requires some rewriting, but I think it would be a good way to distribute the majority of the work and still have an atomic commit. -- Ryan Blue Software Engineer Netflix
