[jira] [Created] (ARROW-17608) [JS] Implement C Data Interface
Kyle Barron created ARROW-17608: --- Summary: [JS] Implement C Data Interface Key: ARROW-17608 URL: https://issues.apache.org/jira/browse/ARROW-17608 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Reporter: Kyle Barron I've recently been working on an implementation of the C Data Interface for Arrow JS, the idea being that Arrow JS can read memory from WebAssembly this way without a copy ([blog post|https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly], [repo|https://github.com/kylebarron/arrow-js-ffi/pull/11]). Dominik [suggested|https://twitter.com/domoritz/status/1562670919469842432?s=20=Ts8HQe_fzgRmecUP1Qrhrw] starting a discussion about potentially adding this into Arrow JS. My implementation is still WIP but figure it's not too early to start a discussion. A couple notes: - I'm focused only on reading FFI memory, so I only have parsing code. I figure writing doesn't really make sense in JS since Wasm can't access arbitrary JS memory - In order to generate FFI memory in the tests, I'm using a small Rust module to convert from an IPC table. If we didn't want to add a rust build step in the tests, that module could be published to NPM Thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Kyle Barron created ARROW-16613: --- Summary: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2) Key: ARROW-16613 URL: https://issues.apache.org/jira/browse/ARROW-16613 Project: Apache Arrow Issue Type: Improvement Components: Parquet, Python Affects Versions: 8.0.0 Reporter: Kyle Barron Hello! I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first and [iterates over every item of the list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] and then [concatenates them on each iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799]. Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once? Repro (in IPython to use `%time`) ``` from io import BytesIO import pyarrow as pa import pyarrow.parquet as pq def create_example_file_meta_data(): data = { "str": pa.array(["a", "b", "c", "d"], type=pa.string()), "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), "bool": pa.array([True, True, False, False], type=pa.bool_()), } table = pa.table(data) metadata_collector = [] pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) return table.schema, metadata_collector[0] schema, meta = create_example_file_meta_data() metadata_collector = [meta] * 500 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms # Wall time: 234 ms metadata_collector = [meta] * 1000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms # Wall time: 970 ms metadata_collector = [meta] * 2000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s # Wall time: 4.3 s metadata_collector = [meta] * 4000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s # Wall time: 17.3 s ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
Kyle Barron created ARROW-16287: --- Summary: PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file Key: ARROW-16287 URL: https://issues.apache.org/jira/browse/ARROW-16287 Project: Apache Arrow Issue Type: Bug Components: Parquet Affects Versions: 7.0.0 Environment: MacOS. Python 3.8.10. pyarrow: '7.0.0' pandas: '1.4.2' numpy: '1.22.3' Reporter: Kyle Barron I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve: ``` from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame(\{"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) # Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") # Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector ) ``` This raises the error ``` --- RuntimeError Traceback (most recent call last) Input In [92], in () > 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas. ``` But all schemas in the `metadata_collector` list seem to be the same: ``` all(metadata_collector[0].schema == meta.schema for meta in metadata_collector) # True ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Barron closed ARROW-2372. -- Resolution: Fixed > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439885#comment-16439885 ] Kyle Barron commented on ARROW-2372: Awesome thanks! > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439646#comment-16439646 ] Kyle Barron commented on ARROW-2372: Sorry, I couldn't figure out how build Arrow and Parquet. I tried to follow [https://github.com/apache/arrow/blob/master/python/doc/source/development.rst] with Conda exactly, but I get errors. Specifically I think it's trying to use gcc 7.2.0 instead of 4.9. I might just have to wait for 9.1. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422572#comment-16422572 ] Kyle Barron commented on ARROW-2372: I edited my code to the script below, which, I believe, writes a parquet file with just the first 2GB csv chunk, then with the first two, and so on, checking each time that it can open the output. Here's the traceback first, which suggests that it was able to open the Parquet file representing around 6GB of csv data, but not the Parquet file representing about 8GB of csv data. {code:java} Starting conversion, up to iteration 0 0.12 minutes Finished reading csv block 0 0.43 minutes Finished writing parquet block 0 1.80 minutes Starting conversion, up to iteration 1 1.80 minutes Finished reading csv block 0 2.12 minutes Finished writing parquet block 0 3.49 minutes Finished reading csv block 1 3.80 minutes Finished writing parquet block 1 5.19 minutes Starting conversion, up to iteration 2 5.20 minutes Finished reading csv block 0 5.52 minutes Finished writing parquet block 0 6.91 minutes Finished reading csv block 1 7.22 minutes Finished writing parquet block 1 8.59 minutes Finished reading csv block 2 8.92 minutes Finished writing parquet block 2 10.29 minutes Starting conversion, up to iteration 3 10.29 minutes Finished reading csv block 0 10.60 minutes Finished writing parquet block 0 11.98 minutes Finished reading csv block 1 12.30 minutes Finished writing parquet block 1 13.66 minutes Finished reading csv block 2 13.98 minutes Finished writing parquet block 2 15.35 minutes Finished reading csv block 3 15.68 minutes Finished writing parquet block 3 17.05 minutes --- ArrowIOError Traceback (most recent call last) in () 29 if j == i: 30 writer.close() ---> 31 pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet') 32 pfs_dict[i] = pf 33 break ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata) 62 self.reader = ParquetReader() 63 source = _ensure_file(source) ---> 64 self.reader.open(source, metadata=metadata) 65 self.common_metadata = common_metadata 66 self._nested_paths_by_prefix = self._build_nested_paths() _parquet.pyx in pyarrow._parquet.ParquetReader.open() error.pxi in pyarrow.lib.check_status() ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument {code} And the source code: {code} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pathlib import Path from time import time t0 = time() zcta_file = Path('gaz2016zcta5distancemiles.csv') pfs_dict = {} for i in range(17): itr = pd.read_csv( zcta_file, header=0, dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, engine='c', chunksize=64617153) # previously determined to be about 2GB of csv data msg = f'Starting conversion, up to iteration {i}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) j = 0 for df in itr: msg = f'Finished reading csv block {j}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) if j == 0: writer = pq.ParquetWriter(f'gaz2016zcta5distancemiles_{i}.parquet', schema=table.schema) writer.write_table(table) msg = f'Finished writing parquet block {j}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) if j == i: writer.close() pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet') pfs_dict[i] = pf break j += 1 {code} > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf =
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420984#comment-16420984 ] Kyle Barron commented on ARROW-2372: To make sure that the schema creation wasn't the issue, I rewrote the loop to be: {code:python} i = 0 for df in itr: i += 1 print(f'Finished reading csv block {i}') table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) if i == 1: writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=table.schema) writer.write_table(table) print(f'Finished writing parquet block {i}') writer.close() {code} but I still get the same exception when trying to read the metadata as above. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2372) ArrowIOError: Invalid argument
Kyle Barron created ARROW-2372: -- Summary: ArrowIOError: Invalid argument Key: ARROW-2372 URL: https://issues.apache.org/jira/browse/ARROW-2372 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0, 0.8.0 Environment: Ubuntu 16.04 Reporter: Kyle Barron I get an ArrowIOError when reading a specific file that was also written by pyarrow. Specifically, the traceback is: {code:python} >>> import pyarrow.parquet as pq >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') --- ArrowIOError Traceback (most recent call last) in () > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in _init_(self, source, metadata, common_metadata) 62 self.reader = ParquetReader() 63 source = _ensure_file(source) ---> 64 self.reader.open(source, metadata=metadata) 65 self.common_metadata = common_metadata 66 self._nested_paths_by_prefix = self._build_nested_paths() _parquet.pyx in pyarrow._parquet.ParquetReader.open() error.pxi in pyarrow.lib.check_status() ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument {code} Here's a reproducible example with the specific file I'm working with. I'm converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get the source data: {code:bash} wget https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip unzip gaz2016zcta5distancemiles.csv.zip{code} Then the basic idea from the [pyarrow Parquet documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] is instantiating the writer class; looping over chunks of the csv and writing them to parquet; then closing the writer object. {code:python} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pathlib import Path zcta_file = Path('gaz2016zcta5distancemiles.csv') itr = pd.read_csv( zcta_file, header=0, dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, engine='c', chunksize=64617153) schema = pa.schema([ pa.field('zip1', pa.string()), pa.field('zip2', pa.string()), pa.field('mi_to_zcta5', pa.float64())]) writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) print(f'Starting conversion') i = 0 for df in itr: i += 1 print(f'Finished reading csv block {i}') table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) writer.write_table(table) print(f'Finished writing parquet block {i}') writer.close() {code} Then running this python script produces the file {code:java} gaz2016zcta5distancemiles.parquet{code} , but just attempting to read the metadata with `pq.ParquetFile()` produces the above exception. I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would complain on import of the csv if the columns in the data were not `string`, `string`, and `float64`, so I think creating the Parquet schema in that way should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)