[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-06-29 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527479#comment-16527479
 ] 

Wes McKinney commented on ARROW-2372:
-

You can build your own wheels from git master by following the instructions in 
https://github.com/apache/arrow/tree/master/python/manylinux1

I hope we will be able to release Arrow 0.10.0 by the end of July

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.10.0
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-06-28 Thread Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526444#comment-16526444
 ] 

Ravi commented on ARROW-2372:
-

Hi Is there a date when the new version of pyarrow would be released with the 
above fix ? We are facing same problem and using the 0.9.0. Thank you.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.10.0
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-06-19 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516896#comment-16516896
 ] 

Wes McKinney commented on ARROW-2372:
-

The Arrow Python library cannot be installed that way. Refer to the Python 
documentation about instructions to build from source 

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.10.0
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-06-19 Thread Beatriz (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516868#comment-16516868
 ] 

Beatriz  commented on ARROW-2372:
-

I got the same issue. Trying to run it with the Arrow git master it returns an 
error using: 

*pip install git+[https://github.com/apache/arrow.git]*

FileNotFoundError: [Errno 2] No such file or directory: 
'C:\\Users\\user\\AppData\\Local\\Temp\\pip-req-build-x2lu_5ci\\setup.py'

Am I missing something? thanks

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.10.0
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439885#comment-16439885
 ] 

Kyle Barron commented on ARROW-2372:


Awesome thanks!

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439884#comment-16439884
 ] 

Antoine Pitrou commented on ARROW-2372:
---

Ok, I have downloaded the dataset and confirms that it works on git master.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439646#comment-16439646
 ] 

Kyle Barron commented on ARROW-2372:


Sorry, I couldn't figure out how build Arrow and Parquet. I tried to follow 
[https://github.com/apache/arrow/blob/master/python/doc/source/development.rst] 
with Conda exactly, but I get errors. Specifically I think it's trying to use 
gcc 7.2.0 instead of 4.9. I might just have to wait for 9.1.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439603#comment-16439603
 ] 

Antoine Pitrou commented on ARROW-2372:
---

This may have been fixed with ARROW-2369. Is there a possibility for you to 
test with Arrow git master?

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-02 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422572#comment-16422572
 ] 

Kyle Barron commented on ARROW-2372:


I edited my code to the script below, which, I believe, writes a parquet file 
with just the first 2GB csv chunk, then with the first two, and so on, checking 
each time that it can open the output. Here's the traceback first, which 
suggests that it was able to open the Parquet file representing around 6GB of 
csv data, but not the Parquet file representing about 8GB of csv data.
{code:java}
Starting conversion, up to iteration 0
0.12 minutes
Finished reading csv block 0
0.43 minutes
Finished writing parquet block 0
1.80 minutes
Starting conversion, up to iteration 1
1.80 minutes
Finished reading csv block 0
2.12 minutes
Finished writing parquet block 0
3.49 minutes
Finished reading csv block 1
3.80 minutes
Finished writing parquet block 1
5.19 minutes
Starting conversion, up to iteration 2
5.20 minutes
Finished reading csv block 0
5.52 minutes
Finished writing parquet block 0
6.91 minutes
Finished reading csv block 1
7.22 minutes
Finished writing parquet block 1
8.59 minutes
Finished reading csv block 2
8.92 minutes
Finished writing parquet block 2
10.29 minutes
Starting conversion, up to iteration 3
10.29 minutes
Finished reading csv block 0
10.60 minutes
Finished writing parquet block 0
11.98 minutes
Finished reading csv block 1
12.30 minutes
Finished writing parquet block 1
13.66 minutes
Finished reading csv block 2
13.98 minutes
Finished writing parquet block 2
15.35 minutes
Finished reading csv block 3
15.68 minutes
Finished writing parquet block 3
17.05 minutes
---
ArrowIOError  Traceback (most recent call last)
 in ()
 29 if j == i:
 30 writer.close()
---> 31 pf = 
pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
 32 pfs_dict[i] = pf
 33 break

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
__init__(self, source, metadata, common_metadata)
 62 self.reader = ParquetReader()
 63 source = _ensure_file(source)
---> 64 self.reader.open(source, metadata=metadata)
 65 self.common_metadata = common_metadata
 66 self._nested_paths_by_prefix = self._build_nested_paths()

_parquet.pyx in pyarrow._parquet.ParquetReader.open()

error.pxi in pyarrow.lib.check_status()

ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
{code}
And the source code:
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from time import time

t0 = time()

zcta_file = Path('gaz2016zcta5distancemiles.csv')

pfs_dict = {}

for i in range(17):
itr = pd.read_csv(
zcta_file,
header=0,
dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
engine='c',
chunksize=64617153)  # previously determined to be about 2GB of csv data

msg = f'Starting conversion, up to iteration {i}'
msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
print(msg)

j = 0
for df in itr:
msg = f'Finished reading csv block {j}'
msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
print(msg)

table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
if j == 0:
writer = pq.ParquetWriter(f'gaz2016zcta5distancemiles_{i}.parquet', 
schema=table.schema)

writer.write_table(table)

msg = f'Finished writing parquet block {j}'
msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
print(msg)

if j == i:
writer.close()
pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
pfs_dict[i] = pf
break

j += 1
{code}

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = 

[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-03-30 Thread Kyle Barron (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420984#comment-16420984
 ] 

Kyle Barron commented on ARROW-2372:


To make sure that the schema creation wasn't the issue, I rewrote the loop to 
be:

{code:python}
i = 0
for df in itr:
i += 1
print(f'Finished reading csv block {i}')

table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
if i == 1:
writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', 
schema=table.schema)

writer.write_table(table)

print(f'Finished writing parquet block {i}')

writer.close()

{code}

but I still get the same exception when trying to read the metadata as above.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)