[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-16 Thread Earle Lyons (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555275#comment-17555275
 ] 

Earle Lyons commented on ARROW-16810:
-

Hi [~westonpace]! 

Good day to you! Thanks so much for the response and very helpful information. 
If I recall correctly, I tried to output the files to a new subdirectory (i.e. 
'/home/user/csv_files/pq_files') and the parquet file was discovered, but I did 
not try a new directory (i.e. '/home/user/data/pq_files'). 

I agree, passing a list of files is probably the best method given the 
available options. To your point, there are benefits and flexibility with 
including/excluding files using a list.

In the future, it would be wonderful if paths with wildcards and supported 
format extensions (i.e. /*.csv) could be handled in the dataset.dataset 
'source' parameter.

Thanks again!  :)

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, 

[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553023#comment-17553023
 ] 

Earle Lyons commented on ARROW-16810:
-

Thanks so much for the response, [~yibocai]! 

I sincerely appreciate the response and information. Your response makes sense 
and seems to align with the behavior. 

I suppose I could add code to remove any parquet files before 
pyarrow.dataset(path, format=custom_csv_format).

However, an argument to read only specific file types would be very helpful.

For example...
 # pyarrow.dataset(path, format=custom_csv_format, filetype='csv')
 # pyarrow.dataset(path, format=custom_csv_format, fileext='.csv')
 # pyarrow.dataset(path/*.csv, format=custom_csv_format)

Per your comment I am CCing [~jorisvandenbossche].

Thanks again! :) 

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 

[jira] [Updated] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earle Lyons updated ARROW-16810:

Summary: [Python] PyArrow: write_dataset - Could not open CSV input source  
(was: PyArrow: write_dataset - Could not open CSV input source)

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16810) PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earle Lyons updated ARROW-16810:

Summary: PyArrow: write_dataset - Could not open CSV input source  (was: 
[Python] PyArrow: write_dataset - Could not open CSV input source)

> PyArrow: write_dataset - Could not open CSV input source
> 
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earle Lyons updated ARROW-16810:

Summary: [Python] PyArrow: write_dataset - Could not open CSV input source  
(was: PyArrow: write_dataset - Could not open CSV input source)

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16810) PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)
Earle Lyons created ARROW-16810:
---

 Summary: PyArrow: write_dataset - Could not open CSV input source
 Key: ARROW-16810
 URL: https://issues.apache.org/jira/browse/ARROW-16810
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.0
 Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
Environment
Reporter: Earle Lyons


Hi Arrow Community! 

Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, I 
am very excited about the project. 

I am experiencing issues with the '{*}write_dataset'{*} function from the 
'{*}dataset{*}' module. Please forgive me, if this is a known issue. However, I 
have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
identified a similar issue. 

I have a directory that contains 90 CSV files (essentially one CSV for each day 
between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV files 
into a dataset and write the dataset to a single Parquet file format. 
Unfortunately, some of the CSV files contained nulls in some columns, which 
presented some issues which were resolved by specifying DataTypes with the 
following Stack Overflow solution:

[How do I specify a dtype for all columns when reading a CSV file with 
pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]

The following code works on the first pass.
{code:python}
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
import re
{code}
{code:python}
pa.__version__
'8.0.0'
{code}
{code:python}
column_types = {}
csv_path = '/home/user/csv_files'
field_re_pattern = "value_*"

# Open a dataset with the 'csv_path' path and 'csv' file format
# and assign to 'dataset1'
dataset1 = ds.dataset(csv_path, format='csv')

# Loop through each field in the 'dataset1' schema,
# match the 'field_re_pattern' regex pattern in the field name,
# and assign 'int64' DataType to the field.name in the 'column_types'
# dictionary 
for field in (field for field in dataset1.schema \
              if re.match(field_re_pattern, field.name)):
        column_types[field.name] = pa.int64()

# Creates options for CSV data using the 'column_types' dictionary
# This returns a 
convert_options = csv.ConvertOptions(column_types=column_types)

# Creates FileFormat for CSV using the 'convert_options' 
# This returns a 
custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)

# Open a a dataset with the 'csv_path' path, instead of using the 
# 'csv' file format, use the 'custom_csv_format' and assign to 
# 'dataset2'
dataset2 = ds.dataset(csv_path, format=custom_csv_format)

# Write the 'dataset2' to the 'csv_path' base directory in the 
# 'parquet' format, and overwrite/ignore if the file exists
ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
existing_data_behavior='overwrite_or_ignore')
{code}
As previously stated, on first pass, the code works and creates a single 
parquet file (part-0.parquet) with the correct data, row count, and schema.

However, if the code is run again, the following error is encountered:
{code:python}
ArrowInvalid: Could not open CSV input source 
'/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V   p   A$A18CEBS
305DEM030TTW �5HZ50GCVJV1CSV
{code}
My interpretation of the error is that on the second pass the 'dataset2' 
variable now includes the 'part-0.parquet' file (which can be confirmed with 
the `dataset2.files` output showing the file) and the CSV reader is attempting 
to parse/read the parquet file.

If this is the case, is there an argument to ignore the parquet file and only 
evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
only CSV files and not all file types in the path. If that is not the current 
behavior.

If this is not the case, any ideas on the cause or solution?

Any assistance would be greatly appreciated.

Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)