[ https://issues.apache.org/jira/browse/ARROW-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Kietzman closed ARROW-8382. ------------------------------- Resolution: Abandoned > [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition > classes > ----------------------------------------------------------------------------------- > > Key: ARROW-8382 > URL: https://issues.apache.org/jira/browse/ARROW-8382 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Francois Saint-Jacques > Assignee: Ben Kietzman > Priority: Major > Labels: dataset > > WritePlan should look like the following. > {code:c++} > class ARROW_DS_EXPORT WritePlan { > public: > /// Execute the WritePlan and return a FileSystemDataset as a result. > Result<FileSystemDataset> Execute(FileSystemDatasetFactory factory); > protected: > /// The schema of the Dataset which will be written > std::shared_ptr<Schema> schema; > /// The format into which fragments will be written > std::shared_ptr<FileFormat> format; > > using SourceAndReader = std::pair<FIleSource, RecordBatchReader>; > /// Files to write > std::vector<SourceAndReader> outputs; > }; > {code} > * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not > sure if it should take the output schema, or the RecordBatchReader should be > already of the right schema. > * Add a class/function that constructs SourceAndReader from Fragments, > Partitioning and base path. > * Move Write() out FIleSystemDataset into WritePlan. It could take a > FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, > not a requirement. > * Simplify writing routine to avoid the PathTree directory structure, it > shouldn't be more complex than `for task in write_tasks: task()`. Not path > construction should be there. > * Move the logic of dropping columns or any filtering into a custom > RecordBatchReader. > The effects are: > * Simplified WritePlan execution, abstracted away from path construction, > and can write to multiple FileSystem and/or Buffers since it doesn't > construct the FileSource. > * By the virtue of using RecordBatchReader instead of Fragment, it isn't > tied to writing from Fragment, it can take any construct that yields a > RecordBatchReader. It also means that WritePlan doesn't have to know about > any Scan related classes. > * Writing can be done with or without partitioning, this logic is given to > whomever generates the SourceAndReader list. > * Should be simpler to test. -- This message was sent by Atlassian Jira (v8.3.4#803005)