[GitHub] [arrow] westonpace commented on a change in pull request #9810: ARROW-11677: [C++][Docs] Add basic C++ datasets documentation

GitBox Tue, 13 Apr 2021 17:19:07 -0700


westonpace commented on a change in pull request #9810:
URL: https://github.com/apache/arrow/pull/9810#discussion_r612851171




##########
File path: docs/source/cpp/dataset.rst
##########
@@ -0,0 +1,389 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+
+================
+Tabular Datasets
+================
+
+.. seealso::
+   :doc:`Dataset API reference <api/dataset>`
+
+.. warning::
+
+    The ``arrow::dataset`` namespace is experimental, and a stable API
+    is not yet guaranteed.
+
+The Arrow Datasets library provides functionality to efficiently work with
+tabular, potentially larger than memory, and multi-file datasets. This 
includes:
+
+* A unified interface that supports different sources and file formats 
(currently,
+  Parquet, Feather / Arrow IPC, and CSV files) and different file systems 
(local,
+  cloud).
+* Discovery of sources (crawling directories, handling partitioned datasets 
with
+  various partitioning schemes, basic schema normalization, ..)

Review comment:
       `..` -> `...`

##########
File path: cpp/src/arrow/dataset/scanner.h
##########
@@ -43,33 +43,39 @@ using RecordBatchGenerator = 
std::function<Future<std::shared_ptr<RecordBatch>>(
 
 namespace dataset {
 
+/// \defgroup dataset-scanning Scanning API
+///
+/// @{
+
 constexpr int64_t kDefaultBatchSize = 1 << 20;
 constexpr int32_t kDefaultBatchReadahead = 32;
 constexpr int32_t kDefaultFragmentReadahead = 8;
 
+/// Scan-specific options, which can be changed between scans of the same 
dataset.
 struct ARROW_DS_EXPORT ScanOptions {
-  // Filter and projection
+  /// A row filter (which can be pushed down to partitioning/reading if 
supported).

Review comment:
       `can` -> `will` or get rid of `if supported`.

##########
File path: docs/source/cpp/dataset.rst
##########
@@ -0,0 +1,389 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+
+================
+Tabular Datasets
+================
+
+.. seealso::
+   :doc:`Dataset API reference <api/dataset>`
+
+.. warning::
+
+    The ``arrow::dataset`` namespace is experimental, and a stable API
+    is not yet guaranteed.
+
+The Arrow Datasets library provides functionality to efficiently work with
+tabular, potentially larger than memory, and multi-file datasets. This 
includes:
+
+* A unified interface that supports different sources and file formats 
(currently,
+  Parquet, Feather / Arrow IPC, and CSV files) and different file systems 
(local,
+  cloud).
+* Discovery of sources (crawling directories, handling partitioned datasets 
with
+  various partitioning schemes, basic schema normalization, ..)
+* Optimized reading with predicate pushdown (filtering rows), projection
+  (selecting and deriving columns), and optionally parallel reading.
+
+The goal is to expand support to other file formats and data sources
+(e.g. database connections) in the future.
+
+Reading Datasets
+----------------
+
+For the examples below, let's create a small dataset consisting
+of a directory with two parquet files:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 50-85
+   :linenos:
+   :lineno-match:
+
+(See the full example at bottom: :ref:`cpp-dataset-full-example`.)
+
+Dataset discovery
+~~~~~~~~~~~~~~~~~
+
+A :class:`arrow::dataset::Dataset` object can be created using the various
+:class:`arrow::dataset::DatasetFactory` objects. Here, we'll use the
+:class:`arrow::dataset::FileSystemDatasetFactory`, which can create a dataset
+given a base directory path:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 151-165
+   :emphasize-lines: 6-11
+   :linenos:
+   :lineno-match:
+
+We're also passing the filesystem to use and the file format to use for 
reading.
+This lets us choose between (for example) reading local files or files in 
Amazon
+S3, or between Parquet and CSV.
+
+In addition to searching a base directory, we can list file paths manually.
+
+Creating a :class:`arrow::dataset::Dataset` does not begin reading the data
+itself. If needed, it only crawls the directory to find all the files
+(:func:`arrow::dataset::FileSystemDataset::files`):
+
+.. code-block:: cpp
+
+   // Print out the files crawled (only for FileSystemDataset)
+   for (const auto& filename : dataset->files()) {
+     std::cout << filename << std::endl;
+   }
+
+…and infers the dataset's schema (by default from the first file):
+
+.. code-block:: cpp
+
+   std::cout << dataset->schema()->ToString() << std::endl;
+
+Using the :func:`arrow::dataset::Dataset::NewScan` method, we can build a
+:class:`arrow::dataset::Scanner` and read the dataset (or a portion of it) into
+a :class:`arrow::Table` with the :func:`arrow::dataset::Scanner::ToTable`
+method:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 151-170
+   :emphasize-lines: 16-19
+   :linenos:
+   :lineno-match:
+
+.. TODO: iterative loading not documented pending API changes
+.. note:: Depending on the size of your dataset, this can require a lot of
+          memory; see :ref:`cpp-dataset-filtering-data` below on
+          filtering/projecting.
+
+Reading different file formats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The above examples use Parquet files on local disk but the Dataset API
+provides a consistent interface across multiple file formats and filesystems.
+(See :ref:`cpp-dataset-cloud-storage` for more information on the latter.)
+Currently, Parquet, Feather / Arrow IPC, and CSV file formats are supported;
+more formats are planned in the future.
+
+If we save the table as Feather files instead of Parquet files:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 87-104
+   :linenos:
+   :lineno-match:
+
+…then we can read the Feather file by passing a 
:class:`arrow::dataset::IpcFileFormat`:

Review comment:
       `a` -> `an`

##########
File path: docs/source/cpp/dataset.rst
##########
@@ -0,0 +1,389 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+
+================
+Tabular Datasets
+================
+
+.. seealso::
+   :doc:`Dataset API reference <api/dataset>`
+
+.. warning::
+
+    The ``arrow::dataset`` namespace is experimental, and a stable API
+    is not yet guaranteed.
+
+The Arrow Datasets library provides functionality to efficiently work with
+tabular, potentially larger than memory, and multi-file datasets. This 
includes:
+
+* A unified interface that supports different sources and file formats 
(currently,
+  Parquet, Feather / Arrow IPC, and CSV files) and different file systems 
(local,
+  cloud).
+* Discovery of sources (crawling directories, handling partitioned datasets 
with
+  various partitioning schemes, basic schema normalization, ..)
+* Optimized reading with predicate pushdown (filtering rows), projection
+  (selecting and deriving columns), and optionally parallel reading.
+
+The goal is to expand support to other file formats and data sources
+(e.g. database connections) in the future.
+
+Reading Datasets
+----------------
+
+For the examples below, let's create a small dataset consisting
+of a directory with two parquet files:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 50-85
+   :linenos:
+   :lineno-match:
+
+(See the full example at bottom: :ref:`cpp-dataset-full-example`.)
+
+Dataset discovery
+~~~~~~~~~~~~~~~~~
+
+A :class:`arrow::dataset::Dataset` object can be created using the various
+:class:`arrow::dataset::DatasetFactory` objects. Here, we'll use the
+:class:`arrow::dataset::FileSystemDatasetFactory`, which can create a dataset
+given a base directory path:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/dataset_documentation_example.cc
+   :language: cpp
+   :lines: 151-165
+   :emphasize-lines: 6-11
+   :linenos:
+   :lineno-match:
+
+We're also passing the filesystem to use and the file format to use for 
reading.
+This lets us choose between (for example) reading local files or files in 
Amazon
+S3, or between Parquet and CSV.
+
+In addition to searching a base directory, we can list file paths manually.
+
+Creating a :class:`arrow::dataset::Dataset` does not begin reading the data
+itself. If needed, it only crawls the directory to find all the files

Review comment:
       `If needed, It only crawls...all the files` -> `It only crawls...all the 
files (if needed).`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #9810: ARROW-11677: [C++][Docs] Add basic C++ datasets documentation

Reply via email to