[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516868#comment-16516868 ]
Beatriz commented on ARROW-2372: --------------------------------- I got the same issue. Trying to run it with the Arrow git master it returns an error using: *pip install git+[https://github.com/apache/arrow.git]* FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\user\\AppData\\Local\\Temp\\pip-req-build-x2lu_5ci\\setup.py' Am I missing something? thanks > ArrowIOError: Invalid argument > ------------------------------ > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 > Reporter: Kyle Barron > Priority: Major > Fix For: 0.10.0 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --------------------------------------------------------------------------- > ArrowIOError Traceback (most recent call last) > <ipython-input-18-149f11bf68a5> in <module>() > ----> 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)