Hi,

per https://spark.apache.org/docs/latest/cloud-integration.html, when using
S3 storage one is advised to set these options:

spark.sql.sources.commitProtocolClass
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
> spark.sql.parquet.output.committer.class
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter


However, looking at code and trying simple tests suggests that
BindingParquetOutputCommitter is not used at all. Specifically, I used this
code

  import org.apache.log4j.{Level, Logger}

  Logger.getLogger("org.apache.spark.internal.io.cloud").setLevel(Level.TRACE)
  
Logger.getLogger("org.apache.hadoop.mapreduce.lib.output").setLevel(Level.DEBUG)

  val spark = SparkSession.builder().master("local[*]")
    .config("spark.sql.sources.outputCommitterClass",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
    .config("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
    .config("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
    .config("fs.s3a.committer.magic.enabled", "true")
    .config("fs.s3.committer.magic.enabled", "true")
    .config("spark.hadoop.fs.s3a.committer.name", "magic")
    .config("spark.hadoop.fs.s3.committer.name", "magic")
    .getOrCreate()
  import spark.implicits._
  val df = Seq("foo", "bar").toDF("s")

  df.write.mode("overwrite").parquet("s3://<some-s3-bucket>/2021-09-07-parquet")

I observe that magic committer is used, and I get trace log message from
PathOutputCommitProtocol, but not from BindingParquetOutputCommitter.
If I remove configuration options that set BindingParquetOutputCommitter, I
still see magic committer used.
The spark.sql.parquet.output.committer.class option is only used in
ParquetFileFormat, where it is copied to
spark.sql.sources.outputCommitterClass,
and that option, in turn, is only used by SQLHadoopMapReduceCommitProtocol
- which we don't use here.

So, it sounds like setting parquet.output.committer.class to
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter is no
longer necessary?
Or is there some code path where it matters?


-- 
Vladimir Prus
http://vladimirprus.com

Reply via email to