Earle Lyons created ARROW-16810: ----------------------------------- Summary: PyArrow: write_dataset - Could not open CSV input source Key: ARROW-16810 URL: https://issues.apache.org/jira/browse/ARROW-16810 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 Environment Reporter: Earle Lyons
Hi Arrow Community! Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, I am very excited about the project. I am experiencing issues with the '{*}write_dataset'{*} function from the '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, I have searched the GitHub 'Issues', as well as Stack Overflow and I have not identified a similar issue. I have a directory that contains 90 CSV files (essentially one CSV for each day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV files into a dataset and write the dataset to a single Parquet file format. Unfortunately, some of the CSV files contained nulls in some columns, which presented some issues which were resolved by specifying DataTypes with the following Stack Overflow solution: [How do I specify a dtype for all columns when reading a CSV file with pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] The following code works on the first pass. {code:python} import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as ds import re {code} {code:python} pa.__version__ '8.0.0' {code} {code:python} column_types = {} csv_path = '/home/user/csv_files' field_re_pattern = "value_*" # Open a dataset with the 'csv_path' path and 'csv' file format # and assign to 'dataset1' dataset1 = ds.dataset(csv_path, format='csv') # Loop through each field in the 'dataset1' schema, # match the 'field_re_pattern' regex pattern in the field name, # and assign 'int64' DataType to the field.name in the 'column_types' # dictionary for field in (field for field in dataset1.schema \ if re.match(field_re_pattern, field.name)): column_types[field.name] = pa.int64() # Creates options for CSV data using the 'column_types' dictionary # This returns a <class 'pyarrow._csv.ConvertOptions'> convert_options = csv.ConvertOptions(column_types=column_types) # Creates FileFormat for CSV using the 'convert_options' # This returns a <class 'pyarrow._dataset.CsvFileFormat'> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) # Open a a dataset with the 'csv_path' path, instead of using the # 'csv' file format, use the 'custom_csv_format' and assign to # 'dataset2' dataset2 = ds.dataset(csv_path, format=custom_csv_format) # Write the 'dataset2' to the 'csv_path' base directory in the # 'parquet' format, and overwrite/ignore if the file exists ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', existing_data_behavior='overwrite_or_ignore') {code} As previously stated, on first pass, the code works and creates a single parquet file (part-0.parquet) with the correct data, row count, and schema. However, if the code is run again, the following error is encountered: {code:python} ArrowInvalid: Could not open CSV input source '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS 305DEM030TTW �5HZ50GCVJV1CSV {code} My interpretation of the error is that on the second pass the 'dataset2' variable now includes the 'part-0.parquet' file (which can be confirmed with the `dataset2.files` output showing the file) and the CSV reader is attempting to parse/read the parquet file. If this is the case, is there an argument to ignore the parquet file and only evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate only CSV files and not all file types in the path. If that is not the current behavior. If this is not the case, any ideas on the cause or solution? Any assistance would be greatly appreciated. Thank you and have a great day! -- This message was sent by Atlassian Jira (v8.20.7#820007)