[jira] [Created] (ARROW-5977) Method for read_csv to limit which columns are read?

2019-07-18 Thread Jordan Samuels (JIRA)
Jordan Samuels created ARROW-5977:
-

 Summary: Method for read_csv to limit which columns are read?
 Key: ARROW-5977
 URL: https://issues.apache.org/jira/browse/ARROW-5977
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.14.0
Reporter: Jordan Samuels


In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this in 
pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5974) read_csv returns truncated read for some valid gzip files

2019-07-17 Thread Jordan Samuels (JIRA)
Jordan Samuels created ARROW-5974:
-

 Summary: read_csv returns truncated read for some valid gzip files
 Key: ARROW-5974
 URL: https://issues.apache.org/jira/browse/ARROW-5974
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
Reporter: Jordan Samuels


If two gzipped files are concatenated together, the result is a valid gzip 
file.  However, it appears that pyarrow.csv.read_csv will only read the portion 
related to the first file.

If the repro script 
[here|https://gist.github.com/jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e] 
is run, the output is:

{{$ python repro.py}}
{{pyarrow.csv only reads one row:}}
{{ x}}
{{0 1}}
{{pandas reads two rows:}}
{{ x}}
{{0 1}}
{{1 2}}
{{pyarrow version: 0.14.0}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-1957) Handle nanosecond timestamps in parquet serialization

2017-12-29 Thread Jordan Samuels (JIRA)
Jordan Samuels created ARROW-1957:
-

 Summary: Handle nanosecond timestamps in parquet serialization
 Key: ARROW-1957
 URL: https://issues.apache.org/jira/browse/ARROW-1957
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.8.0
 Environment: Python 3.6.4, Mac OSX
Reporter: Jordan Samuels
Priority: Minor


The following code

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

n=3
df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
freq='1n', periods=n))
pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}

results in:

{{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
14832288001}}

The desired effect is that we can save nanosecond resolution without losing 
precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
code runs properly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)