[jira] [Created] (ARROW-9522) [Python] read_csv() case of user specified column_names AND include_columns

Athanassios Hatzis (Jira) Sat, 18 Jul 2020 09:30:02 -0700

Athanassios Hatzis created ARROW-9522:
-----------------------------------------


             Summary: [Python] read_csv() case of user specified column_names 
AND include_columns
                 Key: ARROW-9522
                 URL: https://issues.apache.org/jira/browse/ARROW-9522
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.17.1
            Reporter: Athanassios Hatzis


The following piece of code is logically correct according to documentation of 
read_csv()
{code:python}
from pyarrow.csv import read_csv, ParseOptions, ConvertOptions, ReadOptions

file_location = '/data/test/FlatFiles/SupplierPartCatalog/spc_catalog.tsv'

sep = '\t'
nulls = ['\\N']
skip_rows = 1

select = 'sid, pid, price, quantity, inspection, check'.split(', ')
as_names = 'sid, pid, price, quantity, inspection, check'.split(', ')
as_types = 'uint32, uint32, float32, uint32, timestamp[ms], bool'.split(', ')
as_columns = dict(zip(as_names, as_types))

read_options = ReadOptions(skip_rows=skip_rows, 
autogenerate_column_names=False, use_threads=True, column_names=as_names)
parse_options = ParseOptions(delimiter=sep)
convert_options = ConvertOptions(include_columns=select, 
column_types=as_columns, null_values=nulls, strings_can_be_null=True)
read_csv(file_location, read_options, parse_options, 
convert_options).to_pandas()
 {code}
Nevertheless it produces an error

 
{code:java}
File "<ipython-input-2-fa6f85add265>", line 17, in <module>
 read_csv(file_location, read_options, parse_options, 
convert_options).to_pandas()
 File "pyarrow/_csv.pyx", line 617, in pyarrow._csv.read_csv
 File "pyarrow/error.pxi", line 123, in 
pyarrow.lib.pyarrow_internal_check_status
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 6 columns, got 7 
{code}
The error is caused by the parameter {color:#FF0000}as_names{color} passed to 
the argument {color:#FF0000}column_names{color} in 
{color:#FF0000}ReadOptions(){color} method. If we set this  to:


{code:java}
as_names = 'sid, pid, price, quantity, inspection, check, OMIT'.split(', 
'){code}
the code runs without errors but it is not logical to expect the user to pass 
OMIT or any other string for columns that are not included 
({color:#FF0000}include_columns{color}) in those to read from the flat file. 
Notice that a similar parameter {color:#FF0000}column_types {color}is passed in 
the right way, i.e. type for only those columns included.

*Jira could not attach the file as there was a missing token. Please try 
attaching the file again. I am linking it from the previous issue*
*https://issues.apache.org/jira/secure/attachment/12991425/spc_catalog.tsv*

Issue related 
 https://issues.apache.org/jira/browse/ARROW-7628



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9522) [Python] read_csv() case of user specified column_names AND include_columns

Reply via email to