[ https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Earle Lyons updated ARROW-16810: -------------------------------- Summary: [Python] PyArrow: write_dataset - Could not open CSV input source (was: PyArrow: write_dataset - Could not open CSV input source) > [Python] PyArrow: write_dataset - Could not open CSV input source > ----------------------------------------------------------------- > > Key: ARROW-16810 > URL: https://issues.apache.org/jira/browse/ARROW-16810 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 8.0.0 > Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 > Environment > Reporter: Earle Lyons > Priority: Minor > > Hi Arrow Community! > Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, > I am very excited about the project. > I am experiencing issues with the '{*}write_dataset'{*} function from the > '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, > I have searched the GitHub 'Issues', as well as Stack Overflow and I have not > identified a similar issue. > I have a directory that contains 90 CSV files (essentially one CSV for each > day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV > files into a dataset and write the dataset to a single Parquet file format. > Unfortunately, some of the CSV files contained nulls in some columns, which > presented some issues which were resolved by specifying DataTypes with the > following Stack Overflow solution: > [How do I specify a dtype for all columns when reading a CSV file with > pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] > The following code works on the first pass. > {code:python} > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as ds > import re > {code} > {code:python} > pa.__version__ > '8.0.0' > {code} > {code:python} > column_types = {} > csv_path = '/home/user/csv_files' > field_re_pattern = "value_*" > # Open a dataset with the 'csv_path' path and 'csv' file format > # and assign to 'dataset1' > dataset1 = ds.dataset(csv_path, format='csv') > # Loop through each field in the 'dataset1' schema, > # match the 'field_re_pattern' regex pattern in the field name, > # and assign 'int64' DataType to the field.name in the 'column_types' > # dictionary > for field in (field for field in dataset1.schema \ > if re.match(field_re_pattern, field.name)): > column_types[field.name] = pa.int64() > # Creates options for CSV data using the 'column_types' dictionary > # This returns a <class 'pyarrow._csv.ConvertOptions'> > convert_options = csv.ConvertOptions(column_types=column_types) > # Creates FileFormat for CSV using the 'convert_options' > # This returns a <class 'pyarrow._dataset.CsvFileFormat'> > custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) > # Open a a dataset with the 'csv_path' path, instead of using the > # 'csv' file format, use the 'custom_csv_format' and assign to > # 'dataset2' > dataset2 = ds.dataset(csv_path, format=custom_csv_format) > # Write the 'dataset2' to the 'csv_path' base directory in the > # 'parquet' format, and overwrite/ignore if the file exists > ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', > existing_data_behavior='overwrite_or_ignore') > {code} > As previously stated, on first pass, the code works and creates a single > parquet file (part-0.parquet) with the correct data, row count, and schema. > However, if the code is run again, the following error is encountered: > {code:python} > ArrowInvalid: Could not open CSV input source > '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: > Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS > 305DEM030TTW �5HZ50GCVJV1CSV > {code} > My interpretation of the error is that on the second pass the 'dataset2' > variable now includes the 'part-0.parquet' file (which can be confirmed with > the `dataset2.files` output showing the file) and the CSV reader is > attempting to parse/read the parquet file. > If this is the case, is there an argument to ignore the parquet file and only > evaluate the CSV files? Also, if a dataset object has a format of 'csv' or > 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate > only CSV files and not all file types in the path. If that is not the current > behavior. > If this is not the case, any ideas on the cause or solution? > Any assistance would be greatly appreciated. > Thank you and have a great day! -- This message was sent by Atlassian Jira (v8.20.7#820007)