Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Ryan Blue Wed, 15 May 2019 11:13:41 -0700

Replies inline:

On Tue, May 14, 2019 at 3:21 AM Anton Okolnychyi <[email protected]>
wrote:


> I would like to resume this topic. How do we see the proper API for
> migration?
>
> I have a couple of questions in mind:
> - Now, it is based on a Spark job. Do we want to keep it that way because
> the number of files might be huge? Will it be too much for the driver?
>

I think it is reasonable to make this a Spark job. The number of files in
tables we convert typically requires it. This would only be too much for
the driver if all of the files are collected at one time. We commit 500,000
files per batch, which seems to work well. That trades atomicity, though.


> - What should be the scope? Do we want the migration to modify the
> catalog?
>

What do you mean modify the catalog?

We've built two SQL commands using these helper classes and methods:
SNAPSHOT TABLE that creates a new table copy, and MIGRATE TABLE that
basically builds a snapshot, but also renames the new table into place.


> - If we stick to having Dataset[DataFiles], we will have one append per
> dataset partition. How do we want to guarantee that either all files are
> appended or none? One way is to rely on temporary tables but this requires
> looking into the catalog. Another approach is to say that if the migration
> isn’t successful, users will need to remove the metadata and retry.
>

I don't recommend committing from tasks. If Spark runs speculative
attempts, you can get duplicate data.

Another option is to add a way to append manifest files, not just data
files. That way, each task produces a manifest and we can create a single
commit to add all of the manifests. This requires some rewriting, but I
think it would be a good way to distribute the majority of the work and
still have an atomic commit.

-- 
Ryan Blue
Software Engineer
Netflix

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Reply via email to