Earle Lyons created ARROW-16810:
-----------------------------------

             Summary: PyArrow: write_dataset - Could not open CSV input source
                 Key: ARROW-16810
                 URL: https://issues.apache.org/jira/browse/ARROW-16810
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 8.0.0
         Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
Environment
            Reporter: Earle Lyons


Hi Arrow Community! 

Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, I 
am very excited about the project. 

I am experiencing issues with the '{*}write_dataset'{*} function from the 
'{*}dataset{*}' module. Please forgive me, if this is a known issue. However, I 
have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
identified a similar issue. 

I have a directory that contains 90 CSV files (essentially one CSV for each day 
between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV files 
into a dataset and write the dataset to a single Parquet file format. 
Unfortunately, some of the CSV files contained nulls in some columns, which 
presented some issues which were resolved by specifying DataTypes with the 
following Stack Overflow solution:

[How do I specify a dtype for all columns when reading a CSV file with 
pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]

The following code works on the first pass.
{code:python}
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
import re
{code}
{code:python}
pa.__version__
'8.0.0'
{code}
{code:python}
column_types = {}
csv_path = '/home/user/csv_files'
field_re_pattern = "value_*"

# Open a dataset with the 'csv_path' path and 'csv' file format
# and assign to 'dataset1'
dataset1 = ds.dataset(csv_path, format='csv')

# Loop through each field in the 'dataset1' schema,
# match the 'field_re_pattern' regex pattern in the field name,
# and assign 'int64' DataType to the field.name in the 'column_types'
# dictionary 
for field in (field for field in dataset1.schema \
              if re.match(field_re_pattern, field.name)):
        column_types[field.name] = pa.int64()

# Creates options for CSV data using the 'column_types' dictionary
# This returns a <class 'pyarrow._csv.ConvertOptions'>
convert_options = csv.ConvertOptions(column_types=column_types)

# Creates FileFormat for CSV using the 'convert_options' 
# This returns a <class 'pyarrow._dataset.CsvFileFormat'>
custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)

# Open a a dataset with the 'csv_path' path, instead of using the 
# 'csv' file format, use the 'custom_csv_format' and assign to 
# 'dataset2'
dataset2 = ds.dataset(csv_path, format=custom_csv_format)

# Write the 'dataset2' to the 'csv_path' base directory in the 
# 'parquet' format, and overwrite/ignore if the file exists
ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
existing_data_behavior='overwrite_or_ignore')
{code}
As previously stated, on first pass, the code works and creates a single 
parquet file (part-0.parquet) with the correct data, row count, and schema.

However, if the code is run again, the following error is encountered:
{code:python}
ArrowInvalid: Could not open CSV input source 
'/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V       p       A$A18CEBS
305DEM030TTW �5HZ50GCVJV1CSV
{code}
My interpretation of the error is that on the second pass the 'dataset2' 
variable now includes the 'part-0.parquet' file (which can be confirmed with 
the `dataset2.files` output showing the file) and the CSV reader is attempting 
to parse/read the parquet file.

If this is the case, is there an argument to ignore the parquet file and only 
evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
only CSV files and not all file types in the path. If that is not the current 
behavior.

If this is not the case, any ideas on the cause or solution?

Any assistance would be greatly appreciated.

Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to