[jira] [Created] (ARROW-5977) Method for read_csv to limit which columns are read?
Jordan Samuels created ARROW-5977: - Summary: Method for read_csv to limit which columns are read? Key: ARROW-5977 URL: https://issues.apache.org/jira/browse/ARROW-5977 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.14.0 Reporter: Jordan Samuels In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5974) read_csv returns truncated read for some valid gzip files
Jordan Samuels created ARROW-5974: - Summary: read_csv returns truncated read for some valid gzip files Key: ARROW-5974 URL: https://issues.apache.org/jira/browse/ARROW-5974 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Reporter: Jordan Samuels If two gzipped files are concatenated together, the result is a valid gzip file. However, it appears that pyarrow.csv.read_csv will only read the portion related to the first file. If the repro script [here|https://gist.github.com/jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e] is run, the output is: {{$ python repro.py}} {{pyarrow.csv only reads one row:}} {{ x}} {{0 1}} {{pandas reads two rows:}} {{ x}} {{0 1}} {{1 2}} {{pyarrow version: 0.14.0}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-1957) Handle nanosecond timestamps in parquet serialization
Jordan Samuels created ARROW-1957: - Summary: Handle nanosecond timestamps in parquet serialization Key: ARROW-1957 URL: https://issues.apache.org/jira/browse/ARROW-1957 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.8.0 Environment: Python 3.6.4, Mac OSX Reporter: Jordan Samuels Priority: Minor The following code {code:python} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd n=3 df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', freq='1n', periods=n)) pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code} results in: {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 14832288001}} The desired effect is that we can save nanosecond resolution without losing precision (e.g. conversion to ms). Note that if {{freq='1u'}} is used, the code runs properly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)