hyangminj opened a new pull request, #48719:
URL: https://github.com/apache/arrow/pull/48719

   # GH-48695: [Python][C++] Add max_rows parameter to CSV reader
   
   ## Summary
   
   This PR implements the `max_rows` parameter for PyArrow's CSV reader, 
addressing issue #48695. This feature is equivalent to Pandas' `nrows` 
parameter, allowing users to limit the number of rows read from a CSV file.
   
   ## Rationale for Changes
   
   Currently, PyArrow's CSV reader lacks a way to limit the number of rows 
read, which is available in both Pandas (`nrows`) and Polars (`n_rows`). This 
feature is useful for:
   - Previewing large CSV files
   - Memory-constrained environments
   - Testing and development with subsets of data
   - ETL pipelines that process data in chunks
   
   ## Implementation Details
   
   ### C++ Core Changes
   
   1. **Added `max_rows` field to ReadOptions** (`cpp/src/arrow/csv/options.h`)
      - Type: `int64_t`
      - Default: `-1` (unlimited)
      - Values: `-1` for unlimited, or positive integer for exact row count
   
   2. **Validation** (`cpp/src/arrow/csv/options.cc`)
      - `max_rows = 0` → Error (invalid)
      - `max_rows < -1` → Error (invalid)
      - `max_rows = -1` → Read all rows (default)
      - `max_rows > 0` → Read exactly that many rows
   
   3. **Reader Implementations** (`cpp/src/arrow/csv/reader.cc`)
      - **ReaderMixin**: Added `rows_read_` atomic counter for thread-safe row 
tracking
      - **StreamingReaderImpl**:
        - Uses atomic counter to track rows across batches
        - Slices batches when partial batch needed
        - Returns nullptr to signal end-of-stream when limit reached
      - **SerialTableReader**:
        - Builds complete table, then slices to exact row count
      - **AsyncThreadedTableReader**:
        - Processes blocks in parallel, then slices final table
        - Guarantees exact row count despite parallel processing
   
   ### Python Bindings
   
   4. **Cython Declarations** (`python/pyarrow/includes/libarrow.pxd`)
      - Added `int64_t max_rows` to CCSVReadOptions
   
   5. **Python Wrapper** (`python/pyarrow/_csv.pyx`)
      - Added `max_rows` parameter to `ReadOptions.__init__()`
      - Added property getter/setter
      - Updated `equals()` method
      - Updated pickle support (`__getstate__`, `__setstate__`)
      - Added comprehensive docstring
   
   ### Tests
   
   6. **Comprehensive Test Suite** (`python/pyarrow/tests/test_csv.py`)
      - `test_max_rows_basic`: Basic functionality (2 rows, 1 row, more than 
available)
      - `test_max_rows_with_skip_rows`: Interaction with `skip_rows`
      - `test_max_rows_with_skip_rows_after_names`: Interaction with 
`skip_rows_after_names`
      - `test_max_rows_edge_cases`: Validation (0, negative values)
      - `test_max_rows_with_small_blocks`: Multiple blocks with small block_size
      - `test_max_rows_multithreaded`: Exact count guarantee with 
`use_threads=True`
      - `test_max_rows_streaming`: StreamingReader compatibility
      - `test_max_rows_pickle`: Pickle support
   
   ## Key Features
   
   - ✅ **Exact row count guarantee**: Returns exactly `max_rows` rows (not 
approximate)
   - ✅ **Thread-safe**: Works correctly with `use_threads=True`
   - ✅ **Zero-copy slicing**: Uses `RecordBatch::Slice()` and `Table::Slice()`
   - ✅ **All reader types supported**: Serial, Streaming, and AsyncThreaded
   - ✅ **Proper error handling**: Clear validation messages
   - ✅ **Full Python integration**: Properties, pickle, equals
   
   ## Usage Examples
   
   ```python
   import pyarrow.csv as csv
   
   # Read only first 100 rows
   opts = csv.ReadOptions(max_rows=100)
   table = csv.read_csv("large_file.csv", read_options=opts)
   
   # Combine with skip_rows
   opts = csv.ReadOptions(skip_rows=5, max_rows=50)
   table = csv.read_csv("file.csv", read_options=opts)
   
   # Works with streaming reader
   reader = csv.open_csv("file.csv", read_options=opts)
   batch = reader.read_next_batch()
   ```
   
   ## Backward Compatibility
   
   - ✅ Fully backward compatible
   - ✅ Default value `-1` means no behavior change for existing code
   - ✅ All existing tests pass
   
   ## Checklist
   
   - [x] Added implementation for all reader types
   - [x] Added Python bindings
   - [x] Added comprehensive tests (8 test functions)
   - [x] Updated docstrings
   - [x] Thread-safety verified
   - [x] Pickle support added
   - [x] No backward compatibility issues
   
   ## Related Issue
   
   Closes #48695
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to