[ https://issues.apache.org/jira/browse/ARROW-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038455#comment-17038455 ]
Antoine Pitrou commented on ARROW-7628: --------------------------------------- So I don't think there is an Arrow bug here. However, perhaps we can try to make these things easier to find out. cc [~npr] any thoughts? > [Python] read_csv problematic cases > ----------------------------------- > > Key: ARROW-7628 > URL: https://issues.apache.org/jira/browse/ARROW-7628 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Environment: Ubuntu bionic > Reporter: Athanassios Hatzis > Priority: Minor > Labels: csv, pyarrow > Attachments: spc_catalog.tsv > > > Hi, I have found two problematic cases, possibly bugs, in pyarrow *read_csv* > module. I have written the following piece of code and run a test on the > attached CSV file. > The code compares pandas read_csv with pyarrow csv to show that the second is > not behaving correctly with the following set of parameters: > 1. change parameter skip_rows = 10, > {code:python} > Traceback (most recent call last): > File > "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", > line 3326, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "<ipython-input-21-8c5c88b190c4>", line 4, in <module> > read_options=csv.ReadOptions(skip_rows=skip_rows, > autogenerate_column_names=False, use_threads=True, column_names=column_names) > File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist > in CSV file > {code} > 2. change parameters skip_rows = 12, columns = None > In this case you don't get the error above, all columns are fetched, but > compare the two dataframes, the one from pyarrow with to_pandas() and the one > from the output of pandas read_csv(). You will notice that the first one has > not parsed correctly the null values ('\\N') in the last column catname. On > the contrary pandas read_csv managed to parse all the null values correctly. > {code:python} > Out[28]: > 1082 991 16.5 200 2014-09-10 1 bar > 0 1082 997 0.55 100.0 2014-09-10 1 bar > 1 1082 998 7.95 200.0 2014-03-03 0 \N > 2 1083 998 12.50 NaN NaT 0 bar > 3 1083 999 1.00 NaN NaT 0 foo > 4 1084 994 57.30 100.0 2014-12-20 1 \N > 5 1084 995 22.20 NaN NaT 0 foo > 6 1084 998 48.60 200.0 2014-12-20 1 foo > {code} > Python code to test the attached CSV file for the bugs reported above > {code:python} > from pyarrow import csv > import pyarrow as pa > import pandas as pd > file_location = 'spc_catalog.tsv' > sep = '\t' > nulls=['\\N'] > columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname'] > column_names = None > column_types = None > skip_rows = None > nrecords = None > csv.read_csv(file_location, > parse_options=csv.ParseOptions(delimiter=sep), > convert_options=csv.ConvertOptions(include_columns=columns, > column_types=column_types, null_values=nulls), > read_options=csv.ReadOptions(skip_rows=skip_rows, > autogenerate_column_names=False, use_threads=True, column_names=column_names) > ).to_pandas() > pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns, > nrows=nrecords, names=column_names, dtype=column_types) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)