[ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earle Lyons updated ARROW-16810:
--------------------------------
    Summary: [Python] PyArrow: write_dataset - Could not open CSV input source  
(was: PyArrow: write_dataset - Could not open CSV input source)

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -----------------------------------------------------------------
>
>                 Key: ARROW-16810
>                 URL: https://issues.apache.org/jira/browse/ARROW-16810
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>         Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>            Reporter: Earle Lyons
>            Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a <class 'pyarrow._csv.ConvertOptions'>
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a <class 'pyarrow._dataset.CsvFileFormat'>
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V     p       A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to