[jira] [Created] (ARROW-17608) [JS] Implement C Data Interface

2022-09-04 Thread Kyle Barron (Jira)
Kyle Barron created ARROW-17608:
---

 Summary: [JS] Implement C Data Interface
 Key: ARROW-17608
 URL: https://issues.apache.org/jira/browse/ARROW-17608
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: Kyle Barron


I've recently been working on an implementation of the C Data Interface for 
Arrow JS, the idea being that Arrow JS can read memory from WebAssembly this 
way without a copy ([blog 
post|https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly],
 [repo|https://github.com/kylebarron/arrow-js-ffi/pull/11]). Dominik 
[suggested|https://twitter.com/domoritz/status/1562670919469842432?s=20=Ts8HQe_fzgRmecUP1Qrhrw]
 starting a discussion about potentially adding this into Arrow JS.

My implementation is still WIP but figure it's not too early to start a 
discussion. A couple notes:

- I'm focused only on reading FFI memory, so I only have parsing code. I figure 
writing doesn't really make sense in JS since Wasm can't access arbitrary JS 
memory
- In order to generate FFI memory in the tests, I'm using a small Rust module 
to convert from an IPC table. If we didn't want to add a rust build step in the 
tests, that module could be published to NPM

Thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

2022-05-18 Thread Kyle Barron (Jira)
Kyle Barron created ARROW-16613:
---

 Summary: [Python][Parquet] pyarrow.parquet.write_metadata with 
metadata_collector appears to be O(n^2)
 Key: ARROW-16613
 URL: https://issues.apache.org/jira/browse/ARROW-16613
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Parquet, Python
Affects Versions: 8.0.0
Reporter: Kyle Barron


Hello!

 

I've noticed that when writing a `_metadata` file with 
`pyarrow.parquet.write_metadata`, it is very slow with a large 
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that 
the concatenation inside `metadata.append_row_groups` is very slow. The writer 
first and [iterates over every item of the 
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
 and then [concatenates them on each 
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where 
`append_row_groups` accepts a list of `FileMetaData` objects, and where 
concatenation happens only once?

 

Repro (in IPython to use `%time`)

```

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

 

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file

2022-04-22 Thread Kyle Barron (Jira)
Kyle Barron created ARROW-16287:
---

 Summary: PyArrow: RuntimeError: AppendRowGroups requires equal 
schemas when writing _metadata file
 Key: ARROW-16287
 URL: https://issues.apache.org/jira/browse/ARROW-16287
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet
Affects Versions: 7.0.0
 Environment: MacOS. Python 3.8.10.
pyarrow: '7.0.0'
pandas: '1.4.2'
numpy: '1.22.3'
Reporter: Kyle Barron


I'm trying to follow the example here: 
[https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files]
 to write an example partitioned dataset. But I'm consistently getting an error 
about non-equal schemas. Here's a mcve:

```

from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
    pd.DataFrame(\{"partition_col": partition_col, "values": values})
)

metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
    table,
    root_path,
    partition_cols=["partition_col"],
    metadata_collector=metadata_collector,
)

# Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")

# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / "_metadata", metadata_collector=metadata_collector
)

```

This raises the error

```

---
RuntimeError                              Traceback (most recent call last)
Input In [92], in ()
> 1 pq.write_metadata(
      2     table.schema, root_path / "_metadata", 
metadata_collector=metadata_collector
      3 )

File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in 
write_metadata(schema, where, metadata_collector, **kwargs)
   2322 metadata = read_metadata(where)
   2323 for m in metadata_collector:
-> 2324     metadata.append_row_groups(m)
   2325 metadata.write_metadata_file(where)

File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in 
pyarrow._parquet.FileMetaData.append_row_groups()

RuntimeError: AppendRowGroups requires equal schemas.

```

But all schemas in the `metadata_collector` list seem to be the same:

```

all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)

# True

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Barron closed ARROW-2372.
--
Resolution: Fixed

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439885#comment-16439885
 ] 

Kyle Barron commented on ARROW-2372:


Awesome thanks!

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439646#comment-16439646
 ] 

Kyle Barron commented on ARROW-2372:


Sorry, I couldn't figure out how build Arrow and Parquet. I tried to follow 
[https://github.com/apache/arrow/blob/master/python/doc/source/development.rst] 
with Conda exactly, but I get errors. Specifically I think it's trying to use 
gcc 7.2.0 instead of 4.9. I might just have to wait for 9.1.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-02 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422572#comment-16422572
 ] 

Kyle Barron commented on ARROW-2372:


I edited my code to the script below, which, I believe, writes a parquet file 
with just the first 2GB csv chunk, then with the first two, and so on, checking 
each time that it can open the output. Here's the traceback first, which 
suggests that it was able to open the Parquet file representing around 6GB of 
csv data, but not the Parquet file representing about 8GB of csv data.
{code:java}
Starting conversion, up to iteration 0
0.12 minutes
Finished reading csv block 0
0.43 minutes
Finished writing parquet block 0
1.80 minutes
Starting conversion, up to iteration 1
1.80 minutes
Finished reading csv block 0
2.12 minutes
Finished writing parquet block 0
3.49 minutes
Finished reading csv block 1
3.80 minutes
Finished writing parquet block 1
5.19 minutes
Starting conversion, up to iteration 2
5.20 minutes
Finished reading csv block 0
5.52 minutes
Finished writing parquet block 0
6.91 minutes
Finished reading csv block 1
7.22 minutes
Finished writing parquet block 1
8.59 minutes
Finished reading csv block 2
8.92 minutes
Finished writing parquet block 2
10.29 minutes
Starting conversion, up to iteration 3
10.29 minutes
Finished reading csv block 0
10.60 minutes
Finished writing parquet block 0
11.98 minutes
Finished reading csv block 1
12.30 minutes
Finished writing parquet block 1
13.66 minutes
Finished reading csv block 2
13.98 minutes
Finished writing parquet block 2
15.35 minutes
Finished reading csv block 3
15.68 minutes
Finished writing parquet block 3
17.05 minutes
---
ArrowIOError  Traceback (most recent call last)
 in ()
 29 if j == i:
 30 writer.close()
---> 31 pf = 
pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
 32 pfs_dict[i] = pf
 33 break

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
__init__(self, source, metadata, common_metadata)
 62 self.reader = ParquetReader()
 63 source = _ensure_file(source)
---> 64 self.reader.open(source, metadata=metadata)
 65 self.common_metadata = common_metadata
 66 self._nested_paths_by_prefix = self._build_nested_paths()

_parquet.pyx in pyarrow._parquet.ParquetReader.open()

error.pxi in pyarrow.lib.check_status()

ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
{code}
And the source code:
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from time import time

t0 = time()

zcta_file = Path('gaz2016zcta5distancemiles.csv')

pfs_dict = {}

for i in range(17):
itr = pd.read_csv(
zcta_file,
header=0,
dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
engine='c',
chunksize=64617153)  # previously determined to be about 2GB of csv data

msg = f'Starting conversion, up to iteration {i}'
msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
print(msg)

j = 0
for df in itr:
msg = f'Finished reading csv block {j}'
msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
print(msg)

table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
if j == 0:
writer = pq.ParquetWriter(f'gaz2016zcta5distancemiles_{i}.parquet', 
schema=table.schema)

writer.write_table(table)

msg = f'Finished writing parquet block {j}'
msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
print(msg)

if j == i:
writer.close()
pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
pfs_dict[i] = pf
break

j += 1
{code}

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = 

[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-03-30 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420984#comment-16420984
 ] 

Kyle Barron commented on ARROW-2372:


To make sure that the schema creation wasn't the issue, I rewrote the loop to 
be:

{code:python}
i = 0
for df in itr:
i += 1
print(f'Finished reading csv block {i}')

table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
if i == 1:
writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', 
schema=table.schema)

writer.write_table(table)

print(f'Finished writing parquet block {i}')

writer.close()

{code}

but I still get the same exception when trying to read the metadata as above.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2372) ArrowIOError: Invalid argument

2018-03-30 Thread Kyle Barron (JIRA)
Kyle Barron created ARROW-2372:
--

 Summary: ArrowIOError: Invalid argument
 Key: ARROW-2372
 URL: https://issues.apache.org/jira/browse/ARROW-2372
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0, 0.8.0
 Environment: Ubuntu 16.04
Reporter: Kyle Barron


I get an ArrowIOError when reading a specific file that was also written by 
pyarrow. Specifically, the traceback is:
{code:python}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
 ---
 ArrowIOError Traceback (most recent call last)
  in ()
 > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
_init_(self, source, metadata, common_metadata)
 62 self.reader = ParquetReader()
 63 source = _ensure_file(source)
 ---> 64 self.reader.open(source, metadata=metadata)
 65 self.common_metadata = common_metadata
 66 self._nested_paths_by_prefix = self._build_nested_paths()
_parquet.pyx in pyarrow._parquet.ParquetReader.open()
error.pxi in pyarrow.lib.check_status()
ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
{code}
Here's a reproducible example with the specific file I'm working with. I'm 
converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
the source data:
{code:bash}
wget 
https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
unzip gaz2016zcta5distancemiles.csv.zip{code}
Then the basic idea from the [pyarrow Parquet 
documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
 is instantiating the writer class; looping over chunks of the csv and writing 
them to parquet; then closing the writer object.

 
{code:python}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

zcta_file = Path('gaz2016zcta5distancemiles.csv')
itr = pd.read_csv(
zcta_file,
header=0,
dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
engine='c',
chunksize=64617153)

schema = pa.schema([
pa.field('zip1', pa.string()),
pa.field('zip2', pa.string()),
pa.field('mi_to_zcta5', pa.float64())])

writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
print(f'Starting conversion')

i = 0
for df in itr:
i += 1
print(f'Finished reading csv block {i}')

table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
writer.write_table(table)

print(f'Finished writing parquet block {i}')

writer.close()
{code}
Then running this python script produces the file 
{code:java}
gaz2016zcta5distancemiles.parquet{code}
, but just attempting to read the metadata with `pq.ParquetFile()` produces the 
above exception.

I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
complain on import of the csv if the columns in the data were not `string`, 
`string`, and `float64`, so I think creating the Parquet schema in that way 
should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)