I see how to compress writes to a particular file using
arrow::io::CompressedOutputStream::Make, but I’m having difficulty figuring
out how to make Dataset writes compressed. I have my code set up similar to
the CreateExampleParquetHivePartitionedDataset example here
<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/dataset_documentation_example.cc#L113>.


I suspect there is some option on the FileSystemDatasetWriteOptions to
specify compression, but I haven’t been able to uncover it:

ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";
  ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));

FileSystemDatasetWriteOptions is defined here
<https://github.com/apache/arrow/blob/602a76ac58bc8de60a353648f02cf11891563e77/cpp/src/arrow/dataset/file_base.h#L331>
and doesn’t have a compression option.

The file_write_options property is a ParquetFileWriteOptions, which is
defined here
<https://github.com/apache/arrow/blob/8b4942728e7347dc921a2d423e996fea5f9e2102/cpp/src/arrow/dataset/file_parquet.h#L222>
and has a parquet::WriterProperties and parquet::ArrowWriterProperties.
It’s created here:

std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions() {
  std::shared_ptr<ParquetFileWriteOptions> options(
      new ParquetFileWriteOptions(shared_from_this()));
  options->writer_properties = parquet::default_writer_properties();
  options->arrow_writer_properties = parquet::default_arrow_writer_properties();
  return options;
}

parquet::WriterProperties can be created with a compression specified like
this:

    parquet::WriterProperties::Builder file_writer_options_builder;
    file_writer_options_builder.compression(arrow::Compression::BROTLI);
    std::shared_ptr<parquet::WriterProperties> props =
file_writer_options_builder.build();

However, I have been unable to create a FileWriteOptions which includes
this WriterProperties. What is shared_from_this()? Creating a
FileWriteOptions with std::make_shared<> doesn’t compile. Any pointers on
creating a FileWriteOptions in my project, or a better way to specify the
compression type on a dataset write?

Reply via email to