[jira] [Commented] (ARROW-18140) The metadata info will lost in parquet file schema after writing the parquet file by calling the FileSystemDataset::Write() method.

Weston Pace (Jira) Mon, 24 Oct 2022 11:18:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623326#comment-17623326
 ]


Weston Pace commented on ARROW-18140:
-------------------------------------

This could definitely be improved.  The write node, in Acero, takes a single 
{{std::shared_ptr<const KeyValueMetadata> custom_metadata;
}} which is attached to all written files.  At the moment the 
FileSystemDataset::Write method uses metadata from the dataset's projected 
schema as input to the write node for this field:

{noformat}
  // The projected_schema is currently used by pyarrow to preserve the custom 
metadata
  // when reading from a single input file.
  const auto& custom_metadata = 
scanner->options()->projected_schema->metadata();

  RETURN_NOT_OK(
      compute::Declaration::Sequence(
          {
              {"scan", ScanNodeOptions{dataset, scanner->options()}},
              {"filter", 
compute::FilterNodeOptions{scanner->options()->filter}},
              {"project",
               compute::ProjectNodeOptions{std::move(exprs), std::move(names)}},
              {"write", WriteNodeOptions{write_options, custom_metadata}},
          })
          .AddToPlan(plan.get()));
{noformat}

This is not very user friendly and is currently only this way due to slow 
migration from the old capabilities and this just happens to be the way pyarrow 
invokes the datasets API.  I think it would be possible to use this today but 
you would have to create scan options without the ScannerBuilder because the 
ScannerBuilder doesn't allow you to set the projected schema directly.

That being said, it should be fairly simple to add a "custom_metadata" argument 
to {{FileSystemDataset::Write}}.  As long as this isn't null then we should use 
that instead of the projected schema (and probably even migrate pyarrow to 
using this call too).

> The metadata info will lost in parquet file schema after writing the parquet 
> file by calling the FileSystemDataset::Write() method.
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-18140
>                 URL: https://issues.apache.org/jira/browse/ARROW-18140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Ke Jia
>            Priority: Major
>
> This issue can be reproduced by the following code.
> auto format = std::make_shared<ParquetFileFormat>();
> auto fs = std::make_shared<fs::internal::MockFileSystem>(fs::kNoTime);
> FileSystemDatasetWriteOptions write_options;
> write_options.file_write_options = format->DefaultWriteOptions();
> write_options.filesystem = fs;
> write_options.base_dir = "root";
> write_options.partitioning = std::make_shared<HivePartitioning>(schema({}));
> write_options.basename_template = "\{i}.parquet";
> auto metadata =
>     std::shared_ptr<KeyValueMetadata>(new KeyValueMetadata(\{"foo"}, 
> \{"bar"}));
> auto dataset_schema = schema(\{field("a", int64())}, metadata);
> RecordBatchVector batches{
>     ConstantArrayGenerator::Zeroes(kRowsPerBatch, dataset_schema)};
> ASSERT_EQ(0, batches[0]->column(0)->null_count());
> auto dataset = std::make_shared<InMemoryDataset>(dataset_schema, batches);
> ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
> ASSERT_OK(scanner_builder->Project(
>     \{compute::call("add", {compute::field_ref("a"), compute::literal(1)})},
>     \{"a_plus_one"}));
> ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
> // Before write the schema has the metadata info.
> ASSERT_EQ(1, dataset_schema->HasMetadata());
> ASSERT_OK(FileSystemDataset::Write(write_options, scanner));
> ASSERT_OK_AND_ASSIGN(auto dataset_factory, FileSystemDatasetFactory::Make(
>                                                fs, \{"root/0.parquet"}, 
> format, {}));
> ASSERT_OK_AND_ASSIGN(auto written_dataset, 
> dataset_factory->Finish(FinishOptions{}));
> // After write the schema does not has the metadata info.
> ASSERT_EQ(0, written_dataset->schema()->HasMetadata());



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18140) The metadata info will lost in parquet file schema after writing the parquet file by calling the FileSystemDataset::Write() method.

Reply via email to