[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527479#comment-16527479 ] Wes McKinney commented on ARROW-2372: - You can build your own wheels from git master by following the instructions in https://github.com/apache/arrow/tree/master/python/manylinux1 I hope we will be able to release Arrow 0.10.0 by the end of July > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.10.0 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526444#comment-16526444 ] Ravi commented on ARROW-2372: - Hi Is there a date when the new version of pyarrow would be released with the above fix ? We are facing same problem and using the 0.9.0. Thank you. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.10.0 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516896#comment-16516896 ] Wes McKinney commented on ARROW-2372: - The Arrow Python library cannot be installed that way. Refer to the Python documentation about instructions to build from source > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.10.0 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516868#comment-16516868 ] Beatriz commented on ARROW-2372: - I got the same issue. Trying to run it with the Arrow git master it returns an error using: *pip install git+[https://github.com/apache/arrow.git]* FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\user\\AppData\\Local\\Temp\\pip-req-build-x2lu_5ci\\setup.py' Am I missing something? thanks > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.10.0 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439885#comment-16439885 ] Kyle Barron commented on ARROW-2372: Awesome thanks! > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439884#comment-16439884 ] Antoine Pitrou commented on ARROW-2372: --- Ok, I have downloaded the dataset and confirms that it works on git master. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439646#comment-16439646 ] Kyle Barron commented on ARROW-2372: Sorry, I couldn't figure out how build Arrow and Parquet. I tried to follow [https://github.com/apache/arrow/blob/master/python/doc/source/development.rst] with Conda exactly, but I get errors. Specifically I think it's trying to use gcc 7.2.0 instead of 4.9. I might just have to wait for 9.1. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439603#comment-16439603 ] Antoine Pitrou commented on ARROW-2372: --- This may have been fixed with ARROW-2369. Is there a possibility for you to test with Arrow git master? > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422572#comment-16422572 ] Kyle Barron commented on ARROW-2372: I edited my code to the script below, which, I believe, writes a parquet file with just the first 2GB csv chunk, then with the first two, and so on, checking each time that it can open the output. Here's the traceback first, which suggests that it was able to open the Parquet file representing around 6GB of csv data, but not the Parquet file representing about 8GB of csv data. {code:java} Starting conversion, up to iteration 0 0.12 minutes Finished reading csv block 0 0.43 minutes Finished writing parquet block 0 1.80 minutes Starting conversion, up to iteration 1 1.80 minutes Finished reading csv block 0 2.12 minutes Finished writing parquet block 0 3.49 minutes Finished reading csv block 1 3.80 minutes Finished writing parquet block 1 5.19 minutes Starting conversion, up to iteration 2 5.20 minutes Finished reading csv block 0 5.52 minutes Finished writing parquet block 0 6.91 minutes Finished reading csv block 1 7.22 minutes Finished writing parquet block 1 8.59 minutes Finished reading csv block 2 8.92 minutes Finished writing parquet block 2 10.29 minutes Starting conversion, up to iteration 3 10.29 minutes Finished reading csv block 0 10.60 minutes Finished writing parquet block 0 11.98 minutes Finished reading csv block 1 12.30 minutes Finished writing parquet block 1 13.66 minutes Finished reading csv block 2 13.98 minutes Finished writing parquet block 2 15.35 minutes Finished reading csv block 3 15.68 minutes Finished writing parquet block 3 17.05 minutes --- ArrowIOError Traceback (most recent call last) in () 29 if j == i: 30 writer.close() ---> 31 pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet') 32 pfs_dict[i] = pf 33 break ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata) 62 self.reader = ParquetReader() 63 source = _ensure_file(source) ---> 64 self.reader.open(source, metadata=metadata) 65 self.common_metadata = common_metadata 66 self._nested_paths_by_prefix = self._build_nested_paths() _parquet.pyx in pyarrow._parquet.ParquetReader.open() error.pxi in pyarrow.lib.check_status() ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument {code} And the source code: {code} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pathlib import Path from time import time t0 = time() zcta_file = Path('gaz2016zcta5distancemiles.csv') pfs_dict = {} for i in range(17): itr = pd.read_csv( zcta_file, header=0, dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, engine='c', chunksize=64617153) # previously determined to be about 2GB of csv data msg = f'Starting conversion, up to iteration {i}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) j = 0 for df in itr: msg = f'Finished reading csv block {j}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) if j == 0: writer = pq.ParquetWriter(f'gaz2016zcta5distancemiles_{i}.parquet', schema=table.schema) writer.write_table(table) msg = f'Finished writing parquet block {j}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) if j == i: writer.close() pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet') pfs_dict[i] = pf break j += 1 {code} > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf =
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420984#comment-16420984 ] Kyle Barron commented on ARROW-2372: To make sure that the schema creation wasn't the issue, I rewrote the loop to be: {code:python} i = 0 for df in itr: i += 1 print(f'Finished reading csv block {i}') table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) if i == 1: writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=table.schema) writer.write_table(table) print(f'Finished writing parquet block {i}') writer.close() {code} but I still get the same exception when trying to read the metadata as above. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)