> Oh but s3Guard will not solve the atomicity problem, right?
S3Guard does solve the atomicity problem, because compactors don't just rename
directories.
The basic consistency needed for ACID is - list after delete and list after
create (which S3 does not have).
They also place a file named '_orc_acid_version' in the directory.
This happens after rename() returns.
fs.rename(fileStatus.getPath(), newPath);
AcidUtils.OrcAcidVersion.writeVersionFile(newPath, fs);
With S3Guard, all that is needed is to check for that file (& if it is missing
it is not a complete compacted dir yet).
However, the "open a txn for compact & commit it" is definitely neater.
> So that means that the directory will be "visible while in progress", and
> the reader might pick up the compacted directory even when all files
> haven't been copied.
In another thread today, I mentioned how ACID is built on top of ignoring
directories, it can do that easily.
The Parquet or Avro transactional system in Hive boils down to a PathFilter
with some numbers in the path.
Cheers,
Gopal