Thanks Weston!
On Tue, May 25 2021 at 17:55, Weston Pace <[email protected]> wrote: > One minor note is that specifying compression in > parquet::WriterProperties will result in a slightly different file than > one created with arrow::io::CompressedOutputStream::Make. The former tells > parquet the default compression to use for column data > (you could even specify a per-column compression scheme if desired). It is > unique to parquet. The latter applies compression to the entire file. It > could be used on any output format. > > What you have should be fine. There is currently no way (I am aware of) to > specify file-wide compression on dataset writes. This will probably be a > more essential feature once CSV support (or some other format that doesn't > natively handle compression) is added for dataset writes. > > On Sat, May 22, 2021 at 9:17 PM Micah Kornfield <[email protected]> > wrote: > > > > internal::checked_pointer_cast isn't really anything special. It simply > switches between std::static_pointer_cast<T> and > std::dynamic_pointer_cast<T> depending on debug/release compilation. So you > can choose one or the other depending on how confident you are in the type > you are casting. > > > > > > > On Sat, May 22, 2021 at 9:23 PM Xander Dunn <[email protected]> wrote: > > >> > > Alright, I got it working: > > >> > > parquet::WriterProperties::Builder file_writer_options_builder; > file_writer_options_builder.compression(arrow::Compression::BROTLI); > //file_writer_options_builder.compression(arrow::Compression::UNCOMPRESSED); > > std::shared_ptr<parquet::WriterProperties> props = > file_writer_options_builder.build(); > > >> > > std::shared_ptr<ds::FileWriteOptions> file_write_options = > format->DefaultWriteOptions(); > auto parquet_options = > arrow::internal::checked_pointer_cast<ds::ParquetFileWriteOptions>(file_write_options); > > parquet_options->writer_properties = props; > arrow::dataset::FileSystemDatasetWriteOptions write_options; > write_options.file_write_options = parquet_options; > > >> > > But surely a call to arrow::internal is not the intended usage? > > >> > >> > > On Sat, May 22, 2021 at 8:52 PM Xander Dunn <[email protected]> wrote: > > >>> > > I see how to compress writes to a particular file using > arrow::io::CompressedOutputStream::Make, but I’m having difficulty figuring > out how to make Dataset writes compressed. I have my code set up similar to > the CreateExampleParquetHivePartitionedDataset example here. > > >>> > > I suspect there is some option on the FileSystemDatasetWriteOptions to > specify compression, but I haven’t been able to uncover it: > > >>> > > ds::FileSystemDatasetWriteOptions write_options; > write_options.file_write_options = format->DefaultWriteOptions(); > write_options.filesystem = filesystem; > write_options.base_dir = base_path; > write_options.partitioning = partitioning; > write_options.basename_template = "part{i}.parquet"; > ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner)); > > >>> > > FileSystemDatasetWriteOptions is defined here and doesn’t have a > compression option. > > >>> > > The file_write_options property is a ParquetFileWriteOptions, which is > defined here and has a parquet::WriterProperties and > parquet::ArrowWriterProperties. It’s created here: > > >>> > > std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions() > { > std::shared_ptr<ParquetFileWriteOptions> options( > new ParquetFileWriteOptions(shared_from_this())); > options->writer_properties = parquet::default_writer_properties(); > options->arrow_writer_properties = > parquet::default_arrow_writer_properties(); > return options; > } > > >>> > > parquet::WriterProperties can be created with a compression specified like > this: > > >>> > > parquet::WriterProperties::Builder file_writer_options_builder; > file_writer_options_builder.compression(arrow::Compression::BROTLI); > std::shared_ptr<parquet::WriterProperties> props = > file_writer_options_builder.build(); > > >>> > > However, I have been unable to create a FileWriteOptions which includes > this WriterProperties. What is shared_from_this()? Creating a > FileWriteOptions with std::make_shared<> doesn’t compile. Any pointers on > creating a FileWriteOptions in my project, or a better way to specify the > compression type on a dataset write? > >
