Athanassios Hatzis created ARROW-9522: -----------------------------------------
Summary: [Python] read_csv() case of user specified column_names AND include_columns Key: ARROW-9522 URL: https://issues.apache.org/jira/browse/ARROW-9522 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: Athanassios Hatzis The following piece of code is logically correct according to documentation of read_csv() {code:python} from pyarrow.csv import read_csv, ParseOptions, ConvertOptions, ReadOptions file_location = '/data/test/FlatFiles/SupplierPartCatalog/spc_catalog.tsv' sep = '\t' nulls = ['\\N'] skip_rows = 1 select = 'sid, pid, price, quantity, inspection, check'.split(', ') as_names = 'sid, pid, price, quantity, inspection, check'.split(', ') as_types = 'uint32, uint32, float32, uint32, timestamp[ms], bool'.split(', ') as_columns = dict(zip(as_names, as_types)) read_options = ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=as_names) parse_options = ParseOptions(delimiter=sep) convert_options = ConvertOptions(include_columns=select, column_types=as_columns, null_values=nulls, strings_can_be_null=True) read_csv(file_location, read_options, parse_options, convert_options).to_pandas() {code} Nevertheless it produces an error {code:java} File "<ipython-input-2-fa6f85add265>", line 17, in <module> read_csv(file_location, read_options, parse_options, convert_options).to_pandas() File "pyarrow/_csv.pyx", line 617, in pyarrow._csv.read_csv File "pyarrow/error.pxi", line 123, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: CSV parse error: Expected 6 columns, got 7 {code} The error is caused by the parameter {color:#FF0000}as_names{color} passed to the argument {color:#FF0000}column_names{color} in {color:#FF0000}ReadOptions(){color} method. If we set this to: {code:java} as_names = 'sid, pid, price, quantity, inspection, check, OMIT'.split(', '){code} the code runs without errors but it is not logical to expect the user to pass OMIT or any other string for columns that are not included ({color:#FF0000}include_columns{color}) in those to read from the flat file. Notice that a similar parameter {color:#FF0000}column_types {color}is passed in the right way, i.e. type for only those columns included. *Jira could not attach the file as there was a missing token. Please try attaching the file again. I am linking it from the previous issue* *https://issues.apache.org/jira/secure/attachment/12991425/spc_catalog.tsv* Issue related https://issues.apache.org/jira/browse/ARROW-7628 -- This message was sent by Atlassian Jira (v8.3.4#803005)