[jira] [Updated] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1425:

Fix Version/s: (was: 0.8.0)
   0.9.0

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219917#comment-16219917
 ] 

Wes McKinney commented on ARROW-1425:
-

It seems there is still too much in flux on Spark side. Moving this to the next 
milestone

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1555) [Python] write_to_dataset on s3

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219913#comment-16219913
 ] 

ASF GitHub Bot commented on ARROW-1555:
---

wesm commented on issue #1240: ARROW-1555 [Python] Implement Dask exists 
function
URL: https://github.com/apache/arrow/pull/1240#issuecomment-339542169
 
 
   Build looks ok, the failure is unrelated (I restarted the failing job anyway 
so we can get a green build)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] write_to_dataset on s3
> ---
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Assignee: Florian Jetter
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1555) [Python] write_to_dataset on s3

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219911#comment-16219911
 ] 

ASF GitHub Bot commented on ARROW-1555:
---

wesm commented on a change in pull request #1240: ARROW-1555 [Python] Implement 
Dask exists function
URL: https://github.com/apache/arrow/pull/1240#discussion_r147039732
 
 

 ##
 File path: python/pyarrow/filesystem.py
 ##
 @@ -135,6 +135,12 @@ def isfile(self, path):
 """
 raise NotImplementedError
 
+def isfilestore(self):
 
 Review comment:
   Can you make this a private API (`_isfilestore`)? Unclear if normal users 
would need this


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] write_to_dataset on s3
> ---
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Assignee: Florian Jetter
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1133) [C++] Convert all non-accessor function names to PascalCase

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1133:

Fix Version/s: (was: 0.8.0)
   1.0.0

> [C++] Convert all non-accessor function names to PascalCase
> ---
>
> Key: ARROW-1133
> URL: https://issues.apache.org/jira/browse/ARROW-1133
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 1.0.0
>
>
> It seems Google has taken the "cheap functions can be lower case" out of 
> their style guide. I've been asked enough about "which style to use" that I 
> like the idea of UsePascalCaseForEverything 
> https://github.com/google/styleguide/commit/db0a26320f3e930c6ea7225ed53539b4fb31310c#diff-26120df7bca3279afbf749017c778545R4277



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1728) [C++] Run clang-format checks in Travis CI

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219901#comment-16219901
 ] 

ASF GitHub Bot commented on ARROW-1728:
---

wesm commented on issue #1251: ARROW-1728: [C++] Run clang-format checks in 
Travis CI
URL: https://github.com/apache/arrow/pull/1251#issuecomment-339540207
 
 
   Alright, we are looking good:
   
   ```
   Scanning dependencies of target check-format
   clang-format checks failed, run 'make format' to fix
   make[3]: *** [CMakeFiles/check-format] Error 255
   make[2]: *** [CMakeFiles/check-format.dir/all] Error 2
   make[1]: *** [CMakeFiles/check-format.dir/rule] Error 2
   make: *** [check-format] Error 2
   ```
   
   I'll revert the flake and if others are in agreement about failing on 
clang-format issues we can merge this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Run clang-format checks in Travis CI
> --
>
> Key: ARROW-1728
> URL: https://issues.apache.org/jira/browse/ARROW-1728
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> I think it's reasonable to expect contributors to run clang-format on their 
> code. This may lead to a higher number of failed builds but will eliminate 
> noise diffs in unrelated patches



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1646) [Python] pyarrow.array cannot handle NumPy scalar types

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1646:

Fix Version/s: (was: 0.8.0)
   0.9.0

> [Python] pyarrow.array cannot handle NumPy scalar types
> ---
>
> Key: ARROW-1646
> URL: https://issues.apache.org/jira/browse/ARROW-1646
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.9.0
>
>
> Example repro
> {code}
> In [1]: import pyarrow as pa
> impo
> In [2]: import numpy as np
> In [3]: pa.array([np.random.randint(0, 10, size=5), None])
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in ()
> > 1 pa.array([np.random.randint(0, 10, size=5), None])
> /home/wesm/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:24892)()
> 171 if mask is not None:
> 172 raise ValueError("Masks only supported with ndarray-like 
> inputs")
> --> 173 return _sequence_to_array(obj, size, type, pool)
> 174 
> 175 
> /home/wesm/code/arrow/python/pyarrow/array.pxi in 
> pyarrow.lib._sequence_to_array 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:23496)()
>  23 if type is None:
>  24 with nogil:
> ---> 25 check_status(ConvertPySequence(sequence, pool, &out))
>  26 else:
>  27 if size is None:
> /home/wesm/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:7876)()
>  75 message = frombytes(status.message())
>  76 if status.IsInvalid():
> ---> 77 raise ArrowInvalid(message)
>  78 elif status.IsIOError():
>  79 raise ArrowIOError(message)
> ArrowInvalid: 
> /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:740 code: 
> InferArrowTypeAndSize(obj, &size, &type)
> /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:319 code: 
> InferArrowType(obj, out_type)
> /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:299 code: 
> seq_visitor.Visit(obj)
> /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:180 code: 
> VisitElem(ref, level)
> Error inferring Arrow data type for collection of Python objects. Got Python 
> object of type ndarray but can only handle these types: bool, float, integer, 
> date, datetime, bytes, unicode
> {code}
> If these inner values are converted to Python built-in int types then it 
> works fine



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219886#comment-16219886
 ] 

ASF GitHub Bot commented on ARROW-1732:
---

wesm opened a new pull request #1252: ARROW-1732: [Python] Permit creating 
record batches with no columns, test pandas roundtrips
URL: https://github.com/apache/arrow/pull/1252
 
 
   I ran into this rough edge today, invariably serialization code paths will 
need to send across a DataFrame with no columns, this will need to work even if 
`preserve_index=False`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] RecordBatch.from_pandas fails on DataFrame with no columns when 
> preserve_index=False
> -
>
> Key: ARROW-1732
> URL: https://issues.apache.org/jira/browse/ARROW-1732
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> I believe this should have well-defined behavior and not raise an error:
> {code}
> In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
> > 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)()
> 586 df, schema, preserve_index, nthreads=nthreads
> 587 )
> --> 588 return cls.from_arrays(arrays, names, metadata)
> 589 
> 590 @staticmethod
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)()
> 615 
> 616 if not number_of_arrays:
> --> 617 raise ValueError('Record batch cannot contain no arrays 
> (for now)')
> 618 
> 619 num_rows = len(arrays[0])
> ValueError: Record batch cannot contain no arrays (for now)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1732:
--
Labels: pull-request-available  (was: )

> [Python] RecordBatch.from_pandas fails on DataFrame with no columns when 
> preserve_index=False
> -
>
> Key: ARROW-1732
> URL: https://issues.apache.org/jira/browse/ARROW-1732
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> I believe this should have well-defined behavior and not raise an error:
> {code}
> In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
> > 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)()
> 586 df, schema, preserve_index, nthreads=nthreads
> 587 )
> --> 588 return cls.from_arrays(arrays, names, metadata)
> 589 
> 590 @staticmethod
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)()
> 615 
> 616 if not number_of_arrays:
> --> 617 raise ValueError('Record batch cannot contain no arrays 
> (for now)')
> 618 
> 619 num_rows = len(arrays[0])
> ValueError: Record batch cannot contain no arrays (for now)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1732:
---

Assignee: Wes McKinney

> [Python] RecordBatch.from_pandas fails on DataFrame with no columns when 
> preserve_index=False
> -
>
> Key: ARROW-1732
> URL: https://issues.apache.org/jira/browse/ARROW-1732
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> I believe this should have well-defined behavior and not raise an error:
> {code}
> In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
> > 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)()
> 586 df, schema, preserve_index, nthreads=nthreads
> 587 )
> --> 588 return cls.from_arrays(arrays, names, metadata)
> 589 
> 590 @staticmethod
> ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays 
> (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)()
> 615 
> 616 if not number_of_arrays:
> --> 617 raise ValueError('Record batch cannot contain no arrays 
> (for now)')
> 618 
> 619 num_rows = len(arrays[0])
> ValueError: Record batch cannot contain no arrays (for now)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-842) [Python] Handle more kinds of null sentinel objects from pandas 0.x

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-842:
---
Fix Version/s: (was: 0.8.0)
   0.9.0

> [Python] Handle more kinds of null sentinel objects from pandas 0.x
> ---
>
> Key: ARROW-842
> URL: https://issues.apache.org/jira/browse/ARROW-842
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Follow-on work to ARROW-707. See 
> https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L193 
> and discussion in https://github.com/apache/arrow/pull/554



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-842) [Python] Handle more kinds of null sentinel objects from pandas 0.x

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219873#comment-16219873
 ] 

Wes McKinney commented on ARROW-842:


This might wait for more general tooling around NumPy scalar types. See also 
ARROW-1646

> [Python] Handle more kinds of null sentinel objects from pandas 0.x
> ---
>
> Key: ARROW-842
> URL: https://issues.apache.org/jira/browse/ARROW-842
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Follow-on work to ARROW-707. See 
> https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L193 
> and discussion in https://github.com/apache/arrow/pull/554



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1524) [C++] More graceful solution for handling non-zero offsets on inputs and outputs in compute library

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1524.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c

> [C++] More graceful solution for handling non-zero offsets on inputs and 
> outputs in compute library
> ---
>
> Key: ARROW-1524
> URL: https://issues.apache.org/jira/browse/ARROW-1524
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> Currently we must remember to shift by the offset. We should add some inline 
> utility functions to centralize this logic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1482) [C++] Implement casts between date32 and date64

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1482.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c

> [C++] Implement casts between date32 and date64
> ---
>
> Key: ARROW-1482
> URL: https://issues.apache.org/jira/browse/ARROW-1482
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1672) [Python] Failure to write Feather bytes column

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1672.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/238881fae8530a1ae994eb0e283e4783d3dd2855

> [Python] Failure to write Feather bytes column
> --
>
> Key: ARROW-1672
> URL: https://issues.apache.org/jira/browse/ARROW-1672
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> See bug report in https://github.com/wesm/feather/issues/320



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1680) [Python] Timestamp unit change not done in from_pandas() conversion

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1680.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c

> [Python] Timestamp unit change not done in from_pandas() conversion
> ---
>
> Key: ARROW-1680
> URL: https://issues.apache.org/jira/browse/ARROW-1680
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> When calling {{Array.from_pandas}} with a pandas.Series of timestamps that 
> have 'ns' unit and specifying a type to coerce to with 'us' causes problems.  
> When the series has timestamps with a timezone, the unit is ignored.  When 
> the series does not have a timezone, it is applied but causes an 
> OverflowError when printing.
> {noformat}
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> from datetime import datetime
> >>> s = pd.Series([datetime.now()])
> >>> s_nyc = s.dt.tz_localize('tzlocal()').dt.tz_convert('America/New_York')
> >>> arr = pa.Array.from_pandas(s_nyc, type=pa.timestamp('us', 
> >>> tz='America/New_York'))
> >>> arr.type
> TimestampType(timestamp[ns, tz=America/New_York])
> >>> arr = pa.Array.from_pandas(s, type=pa.timestamp('us'))
> >>> arr.type
> TimestampType(timestamp[us])
> >>> print(arr)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
> values = array_format(self, window=10)
>   File "pyarrow/formatting.py", line 28, in array_format
> values.append(value_format(x, 0))
>   File "pyarrow/formatting.py", line 49, in value_format
> return repr(x)
>   File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
> return repr(self.as_py())
>   File "pyarrow/scalar.pxi", line 240, in pyarrow.lib.TimestampValue.as_py 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:21600)
> return converter(value, tzinfo=tzinfo)
>   File "pyarrow/scalar.pxi", line 204, in pyarrow.lib.lambda5 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:7295)
> TimeUnit_MICRO: lambda x, tzinfo: pd.Timestamp(
>   File "pandas/_libs/tslib.pyx", line 402, in 
> pandas._libs.tslib.Timestamp.__new__ (pandas/_libs/tslib.c:10051)
>   File "pandas/_libs/tslib.pyx", line 1467, in 
> pandas._libs.tslib.convert_to_tsobject (pandas/_libs/tslib.c:27665)
> OverflowError: Python int too large to convert to C long
> {noformat}
> A workaround is to manually change values with astype
> {noformat}
> >>> arr = pa.Array.from_pandas(s.values.astype('datetime64[us]'))
> >>> arr.type
> TimestampType(timestamp[us])
> >>> print(arr)
> 
> [
>   Timestamp('2017-10-17 11:04:44.308233')
> ]
> >>> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1675.
-
Resolution: Fixed

Resolved by PR 
https://github.com/apache/arrow/commit/238881fae8530a1ae994eb0e283e4783d3dd2855

> [Python] Use RecordBatch.from_pandas in FeatherWriter.write
> ---
>
> Key: ARROW-1675
> URL: https://issues.apache.org/jira/browse/ARROW-1675
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> In addition to making the implementation simpler, we will also benefit from 
> multithreaded conversions, so faster write speeds



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219863#comment-16219863
 ] 

ASF GitHub Bot commented on ARROW-1675:
---

wesm closed pull request #1250: ARROW-1675: [Python] Use 
RecordBatch.from_pandas in Feather write path
URL: https://github.com/apache/arrow/pull/1250
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py
index 2091c9154..3ba9d652c 100644
--- a/python/pyarrow/feather.py
+++ b/python/pyarrow/feather.py
@@ -23,7 +23,7 @@
 
 from pyarrow.compat import pdapi
 from pyarrow.lib import FeatherError  # noqa
-from pyarrow.lib import Table
+from pyarrow.lib import RecordBatch, Table
 import pyarrow.lib as ext
 
 try:
@@ -75,30 +75,12 @@ def write(self, df):
 if not df.columns.is_unique:
 raise ValueError("cannot serialize duplicate column names")
 
-# TODO(wesm): pipeline conversion to Arrow memory layout
-for i, name in enumerate(df.columns):
-col = df.iloc[:, i]
-
-if pdapi.is_object_dtype(col):
-inferred_type = infer_dtype(col)
-msg = ("cannot serialize column {n} "
-   "named {name} with dtype {dtype}".format(
-   n=i, name=name, dtype=inferred_type))
-
-if inferred_type in ['mixed']:
-
-# allow columns with nulls + an inferable type
-inferred_type = infer_dtype(col[col.notnull()])
-if inferred_type in ['mixed']:
-raise ValueError(msg)
-
-elif inferred_type not in ['unicode', 'string']:
-raise ValueError(msg)
-
-if not isinstance(name, six.string_types):
-name = str(name)
-
-self.writer.write_array(name, col)
+# TODO(wesm): Remove this length check, see ARROW-1732
+if len(df.columns) > 0:
+batch = RecordBatch.from_pandas(df, preserve_index=False)
+for i, name in enumerate(batch.schema.names):
+col = batch[i]
+self.writer.write_array(name, col)
 
 self.writer.close()
 
diff --git a/python/pyarrow/tests/test_feather.py 
b/python/pyarrow/tests/test_feather.py
index 810ee3c8c..9e7fc8863 100644
--- a/python/pyarrow/tests/test_feather.py
+++ b/python/pyarrow/tests/test_feather.py
@@ -279,11 +279,14 @@ def test_delete_partial_file_on_error(self):
 if sys.platform == 'win32':
 pytest.skip('Windows hangs on to file handle for some reason')
 
+class CustomClass(object):
+pass
+
 # strings will fail
 df = pd.DataFrame(
 {
 'numbers': range(5),
-'strings': [b'foo', None, u'bar', 'qux', np.nan]},
+'strings': [b'foo', None, u'bar', CustomClass(), np.nan]},
 columns=['numbers', 'strings'])
 
 path = random_path()
@@ -297,10 +300,13 @@ def test_delete_partial_file_on_error(self):
 def test_strings(self):
 repeats = 1000
 
-# we hvae mixed bytes, unicode, strings
+# Mixed bytes, unicode, strings coerced to binary
 values = [b'foo', None, u'bar', 'qux', np.nan]
 df = pd.DataFrame({'strings': values * repeats})
-self._assert_error_on_write(df, ValueError)
+
+ex_values = [b'foo', None, b'bar', b'qux', np.nan]
+expected = pd.DataFrame({'strings': ex_values * repeats})
+self._check_pandas_roundtrip(df, expected, null_counts=[2 * repeats])
 
 # embedded nulls are ok
 values = ['foo', None, 'bar', 'qux', None]
diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi
index 686e56ead..c9a490960 100644
--- a/python/pyarrow/types.pxi
+++ b/python/pyarrow/types.pxi
@@ -662,7 +662,6 @@ cdef _as_type(type):
 return type_for_alias(type)
 
 
-
 cdef set PRIMITIVE_TYPES = set([
 _Type_NA, _Type_BOOL,
 _Type_UINT8, _Type_INT8,


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Use RecordBatch.from_pandas in FeatherWriter.write
> ---
>
> Key: ARROW-1675
> URL: https://issues.apache.org/jira/browse/ARROW-1675
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>R

[jira] [Resolved] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1721.
-
Resolution: Fixed

Resolved by PR 
https://github.com/apache/arrow/commit/48a6ff856cf4de939f5ced42a09b1b39866efc1e

> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219860#comment-16219860
 ] 

ASF GitHub Bot commented on ARROW-1675:
---

wesm commented on issue #1250: ARROW-1675: [Python] Use RecordBatch.from_pandas 
in Feather write path
URL: https://github.com/apache/arrow/pull/1250#issuecomment-339530622
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Use RecordBatch.from_pandas in FeatherWriter.write
> ---
>
> Key: ARROW-1675
> URL: https://issues.apache.org/jira/browse/ARROW-1675
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> In addition to making the implementation simpler, we will also benefit from 
> multithreaded conversions, so faster write speeds



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219857#comment-16219857
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm closed pull request #1246: ARROW-1721: [Python] Implement null-mask check 
in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index 2c89a9f61..ead3a0481 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -622,8 +622,12 @@ Status NumPyConverter::ConvertDates() {
 
   Ndarray1DIndexer objects(arr_);
 
+  Ndarray1DIndexer mask_values;
+
+  bool have_mask = false;
   if (mask_ != nullptr) {
-return Status::NotImplemented("mask not supported in object conversions 
yet");
+mask_values.Init(mask_);
+have_mask = true;
   }
 
   BuilderType builder(pool_);
@@ -636,10 +640,10 @@ Status NumPyConverter::ConvertDates() {
   PyObject* obj;
   for (int64_t i = 0; i < length_; ++i) {
 obj = objects[i];
-if (PyDate_CheckExact(obj)) {
-  RETURN_NOT_OK(builder.Append(UnboxDate::Unbox(obj)));
-} else if (PandasObjectIsNull(obj)) {
+if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) {
   RETURN_NOT_OK(builder.AppendNull());
+} else if (PyDate_CheckExact(obj)) {
+  RETURN_NOT_OK(builder.Append(UnboxDate::Unbox(obj)));
 } else {
   std::stringstream ss;
   ss << "Error converting from Python objects to Date: ";
@@ -1029,6 +1033,41 @@ Status LoopPySequence(PyObject* sequence, T func) {
   return Status::OK();
 }
 
+template 
+Status LoopPySequenceWithMasks(PyObject* sequence,
+   const Ndarray1DIndexer& mask_values,
+   bool have_mask, T func) {
+  if (PySequence_Check(sequence)) {
+OwnedRef ref;
+Py_ssize_t size = PySequence_Size(sequence);
+if (PyArray_Check(sequence)) {
+  auto array = reinterpret_cast(sequence);
+  Ndarray1DIndexer objects(array);
+  for (int64_t i = 0; i < size; ++i) {
+RETURN_NOT_OK(func(objects[i], have_mask && mask_values[i]));
+  }
+} else {
+  for (int64_t i = 0; i < size; ++i) {
+ref.reset(PySequence_GetItem(sequence, i));
+RETURN_NOT_OK(func(ref.obj(), have_mask && mask_values[i]));
+  }
+}
+  } else if (PyObject_HasAttrString(sequence, "__iter__")) {
+OwnedRef iter = OwnedRef(PyObject_GetIter(sequence));
+PyObject* item;
+int64_t i = 0;
+while ((item = PyIter_Next(iter.obj( {
+  OwnedRef ref = OwnedRef(item);
+  RETURN_NOT_OK(func(ref.obj(), have_mask && mask_values[i]));
+  i++;
+}
+  } else {
+return Status::TypeError("Object is not a sequence or iterable");
+  }
+
+  return Status::OK();
+}
+
 template 
 inline Status NumPyConverter::ConvertTypedLists(const 
std::shared_ptr& type,
 ListBuilder* builder, 
PyObject* list) {
@@ -1037,15 +1076,18 @@ inline Status NumPyConverter::ConvertTypedLists(const 
std::shared_ptr&
 
   PyAcquireGIL lock;
 
-  // TODO: mask not supported here
+  Ndarray1DIndexer mask_values;
+
+  bool have_mask = false;
   if (mask_ != nullptr) {
-return Status::NotImplemented("mask not supported in object conversions 
yet");
+mask_values.Init(mask_);
+have_mask = true;
   }
 
   BuilderT* value_builder = static_cast(builder->value_builder());
 
-  auto foreach_item = [&](PyObject* object) {
-if (PandasObjectIsNull(object)) {
+  auto foreach_item = [&](PyObject* object, bool mask) {
+if (mask || PandasObjectIsNull(object)) {
   return builder->AppendNull();
 } else if (PyArray_Check(object)) {
   auto numpy_array = reinterpret_cast(object);
@@ -1071,7 +1113,7 @@ inline Status NumPyConverter::ConvertTypedLists(const 
std::shared_ptr&
 }
   };
 
-  return LoopPySequence(list, foreach_item);
+  return LoopPySequenceWithMasks(list, mask_values, have_mask, foreach_item);
 }
 
 template <>
@@ -1079,15 +1121,18 @@ inline Status 
NumPyConverter::ConvertTypedLists(
 const std::shared_ptr& type, ListBuilder* builder, PyObject* 
list) {
   PyAcquireGIL lock;
 
-  // TODO: mask not supported here
+  Ndarray1DIndexer mask_values;
+
+  bool have_mask = false;
   if (mask_ != nullptr) {
-return Status::NotImplemented("mask not supported in object conversions 
yet");
+mask_values.Init(mask_);
+have_mask = true;
   }
 
   auto value_builder = static_cast(builder->value_builder());
 
-  auto foreach_item = [&](PyObject* object) {
-if (PandasObjectIsN

[jira] [Resolved] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1484.
-
Resolution: Fixed

Resolved by PR 
https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c

> [C++] Implement (safe and unsafe) casts between timestamps and times of 
> different units
> ---
>
> Key: ARROW-1484
> URL: https://issues.apache.org/jira/browse/ARROW-1484
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219848#comment-16219848
 ] 

ASF GitHub Bot commented on ARROW-1484:
---

wesm closed pull request #1245: ARROW-1484: [C++/Python] Implement casts 
between date, time, timestamp units
URL: https://github.com/apache/arrow/pull/1245
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/compute/cast.cc b/cpp/src/arrow/compute/cast.cc
index e8bbfd347..68a2b1237 100644
--- a/cpp/src/arrow/compute/cast.cc
+++ b/cpp/src/arrow/compute/cast.cc
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "arrow/array.h"
 #include "arrow/buffer.h"
@@ -68,6 +69,24 @@
 namespace arrow {
 namespace compute {
 
+template 
+inline const T* GetValuesAs(const ArrayData& data, int i) {
+  return reinterpret_cast(data.buffers[i]->data()) + data.offset;
+}
+
+namespace {
+
+void CopyData(const Array& input, ArrayData* output) {
+  auto in_data = input.data();
+  output->length = in_data->length;
+  output->null_count = input.null_count();
+  output->buffers = in_data->buffers;
+  output->offset = in_data->offset;
+  output->child_data = in_data->child_data;
+}
+
+}  // namespace
+
 // --
 // Zero copy casts
 
@@ -77,7 +96,9 @@ struct is_zero_copy_cast {
 };
 
 template 
-struct is_zero_copy_cast::value>::type> {
+struct is_zero_copy_cast<
+O, I, typename std::enable_if::value &&
+  !std::is_base_of::value>::type> {
   static constexpr bool value = true;
 };
 
@@ -102,10 +123,7 @@ template 
 struct CastFunctor::value>::type> {
   void operator()(FunctionContext* ctx, const CastOptions& options, const 
Array& input,
   ArrayData* output) {
-auto in_data = input.data();
-output->null_count = input.null_count();
-output->buffers = in_data->buffers;
-output->child_data = in_data->child_data;
+CopyData(input, output);
   }
 };
 
@@ -119,6 +137,7 @@ struct CastFunctorbuffers[1];
+DCHECK_EQ(output->offset, 0);
 memset(buf->mutable_data(), 0, buf->size());
   }
 };
@@ -139,12 +158,16 @@ struct CastFunctorbuffers[1]->data();
-auto out = reinterpret_cast(output->buffers[1]->mutable_data());
 constexpr auto kOne = static_cast(1);
 constexpr auto kZero = static_cast(0);
+
+auto in_data = input.data();
+internal::BitmapReader bit_reader(in_data->buffers[1]->data(), 
in_data->offset,
+  in_data->length);
+auto out = reinterpret_cast(output->buffers[1]->mutable_data());
 for (int64_t i = 0; i < input.length(); ++i) {
-  *out++ = BitUtil::GetBit(data, i) ? kOne : kZero;
+  *out++ = bit_reader.IsSet() ? kOne : kZero;
+  bit_reader.Next();
 }
   }
 };
@@ -189,7 +212,9 @@ struct CastFunctor::v
   void operator()(FunctionContext* ctx, const CastOptions& options, const 
Array& input,
   ArrayData* output) {
 using in_type = typename I::c_type;
-auto in_data = reinterpret_cast(input.data()->buffers[1]->data());
+DCHECK_EQ(output->offset, 0);
+
+const in_type* in_data = GetValuesAs(*input.data(), 1);
 uint8_t* out_data = 
reinterpret_cast(output->buffers[1]->mutable_data());
 for (int64_t i = 0; i < input.length(); ++i) {
   BitUtil::SetBitTo(out_data, i, (*in_data++) != 0);
@@ -204,12 +229,11 @@ struct CastFunctoroffset, 0);
 
 auto in_offset = input.offset();
 
-const auto& input_buffers = input.data()->buffers;
-
-auto in_data = reinterpret_cast(input_buffers[1]->data()) 
+ in_offset;
+const in_type* in_data = GetValuesAs(*input.data(), 1);
 auto out_data = 
reinterpret_cast(output->buffers[1]->mutable_data());
 
 if (!options.allow_int_overflow) {
@@ -217,14 +241,15 @@ struct CastFunctor(std::numeric_limits::min());
 
   if (input.null_count() > 0) {
-const uint8_t* is_valid = input_buffers[0]->data();
-int64_t is_valid_offset = in_offset;
+internal::BitmapReader 
is_valid_reader(input.data()->buffers[0]->data(),
+   in_offset, input.length());
 for (int64_t i = 0; i < input.length(); ++i) {
-  if (ARROW_PREDICT_FALSE(BitUtil::GetBit(is_valid, is_valid_offset++) 
&&
+  if (ARROW_PREDICT_FALSE(is_valid_reader.IsSet() &&
   (*in_data > kMax || *in_data < kMin))) {
 ctx->SetStatus(Status::Invalid("Integer value out of bounds"));
   }
   *out_data++ = static_cast(*in_data++);
+  is_valid_reader.Next();
 }
   } else {
 for (int64_t i = 0; i < i

[jira] [Commented] (ARROW-587) Add JIRA fix version to merge tool

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219850#comment-16219850
 ] 

ASF GitHub Bot commented on ARROW-587:
--

wesm commented on issue #1248: ARROW-587: Add fix version to PR merge tool 
URL: https://github.com/apache/arrow/pull/1248#issuecomment-339528271
 
 
   Tried merging #1245 but there was a bug. Will keep at it until this script 
is right


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add JIRA fix version to merge tool
> --
>
> Key: ARROW-587
> URL: https://issues.apache.org/jira/browse/ARROW-587
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Like parquet-mr's tool. This will make releases less painful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219844#comment-16219844
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1095: ARROW-1425 [Python] Document semantic 
differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#issuecomment-339527887
 
 
   @icexelloss @heimir-sverrisson it may make sense to engage in 
https://github.com/apache/spark/pull/18664 and at least try to process the 
discussion that is going on around time zones. This is some very thorny stuff 
and I don't have the bandwidth right this moment to properly engage with this


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1425:
--
Labels: pull-request-available  (was: )

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1733) [C++] Utility for allocating fixed-size mutable primitive ArrayData with a single memory allocation

2017-10-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1733:
---

 Summary: [C++] Utility for allocating fixed-size mutable primitive 
ArrayData with a single memory allocation
 Key: ARROW-1733
 URL: https://issues.apache.org/jira/browse/ARROW-1733
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


The validity bitmap and the values for the primitive data would be part of a 
single allocation, so better heap locality and possibly better performance in 
aggregate. This same approach is also being worked on for Java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1710) [Java] Decide what to do with non-nullable vectors in new vector class hierarchy

2017-10-25 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219836#comment-16219836
 ] 

Jacques Nadeau commented on ARROW-1710:
---

I'm one of the voices strongly arguing for dropping the additional class 
objects. (I also was the one who originally introduced the two separate sets 
when the code was first developed.) My experience has been the following:

* Extra complexity of managing two different runtime classes is very expensive 
(maintenance, coercing between, managing runtime code generation, etc)
* Most source data is actually declared as nullable but rarely has nulls

As such, having an adaptive interaction where you look at cells 64 values at a 
time and adapt your behavior based on actual nullability (as opposed to 
declared nullability) provides a much better performance lift in real world use 
cases than having specialized code for declared non-nullable situations.

FYI: [~e.levine], the updated approach with vectors is moving to a situation 
where we don't have a bit vector and ultimately also consolidates the buffer 
for the bits and the fixed bytes in the same buffer. In that case, there is no 
heap memory overhead and the direct memory overhead is 1 bit per value, far 
less than necessary.

Also note that in reality, most people focused on super high performance Java 
implementations interact directly with the memory. You can see an example of 
how we do this here: 
https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java#L89

If, in the future, if people need the vector classes to have an additional set 
of methods such as: 
allocateNewNoNull()
setSafeIgnoreNull(int index, int value) 

let's just add those when someone's usecase requires it. No need to have an 
extra set of vectors for that purpose.


> [Java] Decide what to do with non-nullable vectors in new vector class 
> hierarchy 
> -
>
> Key: ARROW-1710
> URL: https://issues.apache.org/jira/browse/ARROW-1710
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java - Vectors
>Reporter: Li Jin
> Fix For: 0.8.0
>
>
> So far the consensus seems to be remove all non-nullable vectors. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219709#comment-16219709
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask 
check in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339501657
 
 
   @wesm Thank you!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries

2017-10-25 Thread Brian Hulette (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette reassigned ARROW-1727:


Assignee: Brian Hulette

> [Format] Expand Arrow streaming format to permit new dictionaries and deltas 
> / additions to existing dictionaries
> -
>
> Key: ARROW-1727
> URL: https://issues.apache.org/jira/browse/ARROW-1727
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Brian Hulette
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1047) [Java] Add generalized stream writer and reader interfaces that are decoupled from IO / message framing

2017-10-25 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-1047:
---

Assignee: Bryan Cutler

> [Java] Add generalized stream writer and reader interfaces that are decoupled 
> from IO / message framing
> ---
>
> Key: ARROW-1047
> URL: https://issues.apache.org/jira/browse/ARROW-1047
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>
> cc [~julienledem] [~elahrvivaz] [~nongli]
> The ArrowWriter 
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java
>  accepts a WriteableByteChannel where the stream is written
> It would be useful to be able to support other kinds of message framing and 
> transport, like GRPC or HTTP. So rather than writing a complete Arrow stream 
> as a single contiguous byte stream, the component messages (schema, 
> dictionaries, and record batches) would be framed as separate messages in the 
> underlying protocol. 
> So if we were using ProtocolBuffers and gRPC as the underlying transport for 
> the stream, we could encapsulate components of an Arrow stream in objects 
> like:
> {code:language=protobuf}
> message ArrowMessagePB {
>   required bytes serialized_data;
> }
> {code}
> If the transport supports zero copy, that is obviously better than 
> serializing then parsing a protocol buffer.
> We should do this work in C++ as well to support more flexible stream 
> transport. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219454#comment-16219454
 ] 

ASF GitHub Bot commented on ARROW-1723:
---

MaxRis commented on a change in pull request #1244: ARROW-1723: [C++] add 
ARROW_STATIC to mark static libs on Windows
URL: https://github.com/apache/arrow/pull/1244#discussion_r146974055
 
 

 ##
 File path: cpp/cmake_modules/BuildUtils.cmake
 ##
 @@ -154,22 +161,28 @@ function(ADD_ARROW_LIB LIB_NAME)
   endif()
 
   if (ARROW_BUILD_STATIC)
-  if (MSVC)
-set(LIB_NAME_STATIC ${LIB_NAME}_static)
-  else()
-set(LIB_NAME_STATIC ${LIB_NAME})
-  endif()
-  add_library(${LIB_NAME}_static STATIC 
$)
+if (MSVC)
+  set(LIB_NAME_STATIC ${LIB_NAME}_static)
+else()
+  set(LIB_NAME_STATIC ${LIB_NAME})
+endif()
+add_library(${LIB_NAME}_static STATIC ${LIB_DEPS})
+if(EXTRA_DEPS)
+  add_dependencies(${LIB_NAME}_static ${EXTRA_DEPS})
+endif()
+
 set_target_properties(${LIB_NAME}_static
   PROPERTIES
   LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}"
   OUTPUT_NAME ${LIB_NAME_STATIC})
 
-  target_link_libraries(${LIB_NAME}_static
+target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC)
 
 Review comment:
   @JohnPJenkins  To avoid confusion, it might make sense to define this only 
for MSVC


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Windows: __declspec(dllexport) specified when building arrow static library
> ---
>
> Key: ARROW-1723
> URL: https://issues.apache.org/jira/browse/ARROW-1723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: John Jenkins
>  Labels: pull-request-available
>
> As I understand it, dllexport/dllimport should be left out when building and 
> using static libraries on Windows. A PR will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1660) [Python] pandas field values are messed up across rows

2017-10-25 Thread MIkhail Osckin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219432#comment-16219432
 ] 

MIkhail Osckin commented on ARROW-1660:
---

I definitely tested it with the latest pyarrow version at the moment. I had the 
same intuition that this issue might be related to splicing, because my initial 
dataset was ordered by id field and top of the dataset (after to_pandas) was 
something like this 10012, 10015, 10034, and the row with id like 10018 had 
values from 100034 and only part of them at least in one column (and if i 
remember well 10018 was the exact third id by ascendence.

> [Python] pandas field values are messed up across rows
> --
>
> Key: ARROW-1660
> URL: https://issues.apache.org/jira/browse/ARROW-1660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
> Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3
>Reporter: MIkhail Osckin
>Assignee: Wes McKinney
>
> I have the following scala case class to store sparse matrix data to read it 
> later using python
> {code:java}
> case class CooVector(
> id: Int,
> row_ids: Seq[Int],
> rowsIdx: Seq[Int],
> colIdx: Seq[Int],
> data: Seq[Double])
> {code}
> I save the dataset of this type to multiple parquet files using spark and 
> then read it using pyarrow.parquet and convert the result to pandas dataset.
> The problem i have is that some values end up in wrong rows, for example, 
> row_ids might end up in wrong cooVector row. I have no idea what the reason 
> is but might be it is related to the fact that the fields are of variable 
> sizes. And everything is correct if i read it using spark. Also i checked 
> to_pydict method and the result is correct, so seems like the problem 
> somewhere in to_pandas method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1728) [C++] Run clang-format checks in Travis CI

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1728:
--
Labels: pull-request-available  (was: )

> [C++] Run clang-format checks in Travis CI
> --
>
> Key: ARROW-1728
> URL: https://issues.apache.org/jira/browse/ARROW-1728
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> I think it's reasonable to expect contributors to run clang-format on their 
> code. This may lead to a higher number of failed builds but will eliminate 
> noise diffs in unrelated patches



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1728) [C++] Run clang-format checks in Travis CI

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219426#comment-16219426
 ] 

ASF GitHub Bot commented on ARROW-1728:
---

wesm opened a new pull request #1251: ARROW-1728: [C++] Run clang-format checks 
in Travis CI
URL: https://github.com/apache/arrow/pull/1251
 
 
   I also deliberately checked in a single flake so I can confirm this is 
working properly


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Run clang-format checks in Travis CI
> --
>
> Key: ARROW-1728
> URL: https://issues.apache.org/jira/browse/ARROW-1728
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> I think it's reasonable to expect contributors to run clang-format on their 
> code. This may lead to a higher number of failed builds but will eliminate 
> noise diffs in unrelated patches



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1728) [C++] Run clang-format checks in Travis CI

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1728:
---

Assignee: Wes McKinney

> [C++] Run clang-format checks in Travis CI
> --
>
> Key: ARROW-1728
> URL: https://issues.apache.org/jira/browse/ARROW-1728
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> I think it's reasonable to expect contributors to run clang-format on their 
> code. This may lead to a higher number of failed builds but will eliminate 
> noise diffs in unrelated patches



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1491) [C++] Add casting implementations from strings to numbers or boolean

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219398#comment-16219398
 ] 

Wes McKinney commented on ARROW-1491:
-

While this would be nice, it's not immediately urgent. Some help would be 
appreciated

> [C++] Add casting implementations from strings to numbers or boolean
> 
>
> Key: ARROW-1491
> URL: https://issues.apache.org/jira/browse/ARROW-1491
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1491) [C++] Add casting implementations from strings to numbers or boolean

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1491:

Fix Version/s: (was: 0.8.0)
   0.9.0

> [C++] Add casting implementations from strings to numbers or boolean
> 
>
> Key: ARROW-1491
> URL: https://issues.apache.org/jira/browse/ARROW-1491
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219395#comment-16219395
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check 
in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339448319
 
 
   Turns out I can push to your branch, so done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219390#comment-16219390
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check 
in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339447985
 
 
   Couple flake8 warnings:
   
   ```
   +flake8 --count /home/travis/build/apache/arrow/python/pyarrow
   
/home/travis/build/apache/arrow/python/pyarrow/tests/test_convert_pandas.py:22:1:
 F401 'unittest' imported but unused
   
/home/travis/build/apache/arrow/python/pyarrow/tests/test_convert_pandas.py:1115:16:
 E231 missing whitespace after ','
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1675:
--
Labels: pull-request-available  (was: )

> [Python] Use RecordBatch.from_pandas in FeatherWriter.write
> ---
>
> Key: ARROW-1675
> URL: https://issues.apache.org/jira/browse/ARROW-1675
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> In addition to making the implementation simpler, we will also benefit from 
> multithreaded conversions, so faster write speeds



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219316#comment-16219316
 ] 

ASF GitHub Bot commented on ARROW-1675:
---

wesm opened a new pull request #1250: ARROW-1675: [Python] Use 
RecordBatch.from_pandas in Feather write path
URL: https://github.com/apache/arrow/pull/1250
 
 
   This also makes Feather writes more robust to columns having a mix of 
unicode and bytes (these gets coerced to binary)
   
   Also resolves ARROW-1672


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Use RecordBatch.from_pandas in FeatherWriter.write
> ---
>
> Key: ARROW-1675
> URL: https://issues.apache.org/jira/browse/ARROW-1675
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> In addition to making the implementation simpler, we will also benefit from 
> multithreaded conversions, so faster write speeds



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1672) [Python] Failure to write Feather bytes column

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1672:
---

Assignee: Wes McKinney

> [Python] Failure to write Feather bytes column
> --
>
> Key: ARROW-1672
> URL: https://issues.apache.org/jira/browse/ARROW-1672
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> See bug report in https://github.com/wesm/feather/issues/320



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False

2017-10-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1732:
---

 Summary: [Python] RecordBatch.from_pandas fails on DataFrame with 
no columns when preserve_index=False
 Key: ARROW-1732
 URL: https://issues.apache.org/jira/browse/ARROW-1732
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.8.0


I believe this should have well-defined behavior and not raise an error:

{code}
In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)
---
ValueErrorTraceback (most recent call last)
 in ()
> 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False)

~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas 
(/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)()
586 df, schema, preserve_index, nthreads=nthreads
587 )
--> 588 return cls.from_arrays(arrays, names, metadata)
589 
590 @staticmethod

~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays 
(/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)()
615 
616 if not number_of_arrays:
--> 617 raise ValueError('Record batch cannot contain no arrays 
(for now)')
618 
619 num_rows = len(arrays[0])

ValueError: Record batch cannot contain no arrays (for now)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1718) [Python] Creating a pyarrow.Array of date type from pandas causes error

2017-10-25 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-1718:

Description: 
When calling {{Array.from_pandas}} with a pandas.Series of dates and specifying 
the desired pyarrow type, an error occurs.  If the type is not specified then 
{{from_pandas}} will interpret the data as a timestamp type.

{code}
import pandas as pd
import pyarrow as pa
import datetime

arr = pa.array([datetime.date(2017, 10, 23)])
c = pa.Column.from_array("d", arr)

s = c.to_pandas()
print(s)
# 0   2017-10-23
# Name: d, dtype: datetime64[ns]

result = pa.Array.from_pandas(s, type=pa.date32())
print(result)
"""
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ 
(/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
  File 
"/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
 line 28, in array_format
values.append(value_format(x, 0))
  File 
"/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
 line 49, in value_format
return repr(x)
  File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ 
(/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
  File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py 
(/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368)
ValueError: year is out of range
"""
{code}

  was:
When calling {Array.from_pandas} with a pandas.Series of dates and specifying 
the desired pyarrow type, an error occurs.  If the type is not specified then 
{from_pandas} will interpret the data as a timestamp type.

{code}
import pandas as pd
import pyarrow as pa
import datetime

arr = pa.array([datetime.date(2017, 10, 23)])
c = pa.Column.from_array("d", arr)

s = c.to_pandas()
print(s)
# 0   2017-10-23
# Name: d, dtype: datetime64[ns]

result = pa.Array.from_pandas(s, type=pa.date32())
print(result)
"""
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ 
(/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
  File 
"/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
 line 28, in array_format
values.append(value_format(x, 0))
  File 
"/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
 line 49, in value_format
return repr(x)
  File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ 
(/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
  File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py 
(/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368)
ValueError: year is out of range
"""
{code}


> [Python] Creating a pyarrow.Array of date type from pandas causes error
> ---
>
> Key: ARROW-1718
> URL: https://issues.apache.org/jira/browse/ARROW-1718
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> When calling {{Array.from_pandas}} with a pandas.Series of dates and 
> specifying the desired pyarrow type, an error occurs.  If the type is not 
> specified then {{from_pandas}} will interpret the data as a timestamp type.
> {code}
> import pandas as pd
> import pyarrow as pa
> import datetime
> arr = pa.array([datetime.date(2017, 10, 23)])
> c = pa.Column.from_array("d", arr)
> s = c.to_pandas()
> print(s)
> # 0   2017-10-23
> # Name: d, dtype: datetime64[ns]
> result = pa.Array.from_pandas(s, type=pa.date32())
> print(result)
> """
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
>   File 
> "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
>  line 28, in array_format
> values.append(value_format(x, 0))
>   File 
> "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
>  line 49, in value_format
> return repr(x)
>   File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
>   File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368)
> ValueError: year is out o

[jira] [Created] (ARROW-1731) [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas

2017-10-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1731:
---

 Summary: [Python] Provide for selecting a subset of columns to 
convert in RecordBatch/Table.from_pandas
 Key: ARROW-1731
 URL: https://issues.apache.org/jira/browse/ARROW-1731
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


Currently it's all-or-nothing, and to do the subsetting in pandas incurs a data 
copy. This would enable columns (by name or index) to be selected out without 
additional data copying

cc [~cpcloud] [~jreback]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1675:
---

Assignee: Wes McKinney

> [Python] Use RecordBatch.from_pandas in FeatherWriter.write
> ---
>
> Key: ARROW-1675
> URL: https://issues.apache.org/jira/browse/ARROW-1675
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> In addition to making the implementation simpler, we will also benefit from 
> multithreaded conversions, so faster write speeds



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1718) [Python] Creating a pyarrow.Array of date type from pandas causes error

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1718:
---

Assignee: Wes McKinney

> [Python] Creating a pyarrow.Array of date type from pandas causes error
> ---
>
> Key: ARROW-1718
> URL: https://issues.apache.org/jira/browse/ARROW-1718
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> When calling {Array.from_pandas} with a pandas.Series of dates and specifying 
> the desired pyarrow type, an error occurs.  If the type is not specified then 
> {from_pandas} will interpret the data as a timestamp type.
> {code}
> import pandas as pd
> import pyarrow as pa
> import datetime
> arr = pa.array([datetime.date(2017, 10, 23)])
> c = pa.Column.from_array("d", arr)
> s = c.to_pandas()
> print(s)
> # 0   2017-10-23
> # Name: d, dtype: datetime64[ns]
> result = pa.Array.from_pandas(s, type=pa.date32())
> print(result)
> """
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
>   File 
> "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
>  line 28, in array_format
> values.append(value_format(x, 0))
>   File 
> "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py",
>  line 49, in value_format
> return repr(x)
>   File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
>   File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py 
> (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368)
> ValueError: year is out of range
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1730) [Python] Incorrect result from pyarrow.array when passing timestamp type

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219186#comment-16219186
 ] 

Wes McKinney commented on ARROW-1730:
-

But

{code}
In [15]: pa.array(np.array([0], dtype='int64'), type=pa.timestamp('ns'))
Out[15]: 

[
  Timestamp('1970-01-01 00:00:00')
]
{code}

> [Python] Incorrect result from pyarrow.array when passing timestamp type
> 
>
> Key: ARROW-1730
> URL: https://issues.apache.org/jira/browse/ARROW-1730
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> Even with the ARROW-1484 patch, we have:
> {code: language=python}
> In [10]: pa.array([0], type=pa.timestamp('ns'))
> Out[10]: 
> 
> [
>   Timestamp('1968-01-12 11:18:14.409378304')
> ]
> In [11]: pa.array([0], type='int64').cast(pa.timestamp('ns'))
> Out[11]: 
> 
> [
>   Timestamp('1970-01-01 00:00:00')
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1730) [Python] Incorrect result from pyarrow.array when passing timestamp type

2017-10-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1730:
---

 Summary: [Python] Incorrect result from pyarrow.array when passing 
timestamp type
 Key: ARROW-1730
 URL: https://issues.apache.org/jira/browse/ARROW-1730
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.8.0


Even with the ARROW-1484 patch, we have:

{code: language=python}
In [10]: pa.array([0], type=pa.timestamp('ns'))
Out[10]: 

[
  Timestamp('1968-01-12 11:18:14.409378304')
]

In [11]: pa.array([0], type='int64').cast(pa.timestamp('ns'))
Out[11]: 

[
  Timestamp('1970-01-01 00:00:00')
]
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1660) [Python] pandas field values are messed up across rows

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1660:

Fix Version/s: (was: 0.8.0)

> [Python] pandas field values are messed up across rows
> --
>
> Key: ARROW-1660
> URL: https://issues.apache.org/jira/browse/ARROW-1660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
> Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3
>Reporter: MIkhail Osckin
>Assignee: Wes McKinney
>
> I have the following scala case class to store sparse matrix data to read it 
> later using python
> {code:java}
> case class CooVector(
> id: Int,
> row_ids: Seq[Int],
> rowsIdx: Seq[Int],
> colIdx: Seq[Int],
> data: Seq[Double])
> {code}
> I save the dataset of this type to multiple parquet files using spark and 
> then read it using pyarrow.parquet and convert the result to pandas dataset.
> The problem i have is that some values end up in wrong rows, for example, 
> row_ids might end up in wrong cooVector row. I have no idea what the reason 
> is but might be it is related to the fact that the fields are of variable 
> sizes. And everything is correct if i read it using spark. Also i checked 
> to_pydict method and the result is correct, so seems like the problem 
> somewhere in to_pandas method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1660) [Python] pandas field values are messed up across rows

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219171#comment-16219171
 ] 

Wes McKinney commented on ARROW-1660:
-

Is it possible you were using pyarrow < 0.7.0? There was a bug ARROW-1357 that 
was fixed that would cause the issue you were seeing. I'm a bit at a loss since 
the relevant test case is 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_convert_pandas.py#L600.
 I will move off the 0.8.0 milestone, but leave the issue open in case you can 
find a repro

> [Python] pandas field values are messed up across rows
> --
>
> Key: ARROW-1660
> URL: https://issues.apache.org/jira/browse/ARROW-1660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
> Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3
>Reporter: MIkhail Osckin
>Assignee: Wes McKinney
>
> I have the following scala case class to store sparse matrix data to read it 
> later using python
> {code:java}
> case class CooVector(
> id: Int,
> row_ids: Seq[Int],
> rowsIdx: Seq[Int],
> colIdx: Seq[Int],
> data: Seq[Double])
> {code}
> I save the dataset of this type to multiple parquet files using spark and 
> then read it using pyarrow.parquet and convert the result to pandas dataset.
> The problem i have is that some values end up in wrong rows, for example, 
> row_ids might end up in wrong cooVector row. I have no idea what the reason 
> is but might be it is related to the fact that the fields are of variable 
> sizes. And everything is correct if i read it using spark. Also i checked 
> to_pydict method and the result is correct, so seems like the problem 
> somewhere in to_pandas method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219127#comment-16219127
 ] 

ASF GitHub Bot commented on ARROW-1455:
---

wesm commented on issue #1249: ARROW-1455 [Python] Add Dockerfile for 
validating Dask integration
URL: https://github.com/apache/arrow/pull/1249#issuecomment-339405716
 
 
   We should not check data files in to the git repo, so we will need to handle 
test data in some other way. We will also want to collect the Python-related 
integration tests someplace Python-specific. I will review in more detail when 
I can


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add Dockerfile for validating Dask integration outside of usual CI
> ---
>
> Key: ARROW-1455
> URL: https://issues.apache.org/jira/browse/ARROW-1455
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
>
> Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the 
> moment, but we can add a testing set up in 
> https://github.com/apache/arrow/tree/master/python/testing so that this can 
> be validated on an ad hoc basis in a reproducible way.
> see also ARROW-1417



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219125#comment-16219125
 ] 

ASF GitHub Bot commented on ARROW-1455:
---

wesm commented on issue #1249: ARROW-1455 [Python] Add Dockerfile for 
validating Dask integration
URL: https://github.com/apache/arrow/pull/1249#issuecomment-339405716
 
 
   We should not check in data files in to the git repo, so we will need to 
handle test data in some other way. We will also want to collect the 
Python-related integration tests someplace Python-specific. I will review in 
more detail when I can


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add Dockerfile for validating Dask integration outside of usual CI
> ---
>
> Key: ARROW-1455
> URL: https://issues.apache.org/jira/browse/ARROW-1455
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
>
> Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the 
> moment, but we can add a testing set up in 
> https://github.com/apache/arrow/tree/master/python/testing so that this can 
> be validated on an ad hoc basis in a reproducible way.
> see also ARROW-1417



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1455:
--
Labels: pull-request-available  (was: )

> [Python] Add Dockerfile for validating Dask integration outside of usual CI
> ---
>
> Key: ARROW-1455
> URL: https://issues.apache.org/jira/browse/ARROW-1455
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
>
> Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the 
> moment, but we can add a testing set up in 
> https://github.com/apache/arrow/tree/master/python/testing so that this can 
> be validated on an ad hoc basis in a reproducible way.
> see also ARROW-1417



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219122#comment-16219122
 ] 

ASF GitHub Bot commented on ARROW-1455:
---

heimir-sverrisson opened a new pull request #1249: ARROW-1455 [Python] Add 
Dockerfile for validating Dask integration
URL: https://github.com/apache/arrow/pull/1249
 
 
   A Docker container is created with all the dependencies needed to pull down 
the Dask code from Github and install it locally, together with Arrow, to run 
an integration test.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add Dockerfile for validating Dask integration outside of usual CI
> ---
>
> Key: ARROW-1455
> URL: https://issues.apache.org/jira/browse/ARROW-1455
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
>
> Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the 
> moment, but we can add a testing set up in 
> https://github.com/apache/arrow/tree/master/python/testing so that this can 
> be validated on an ad hoc basis in a reproducible way.
> see also ARROW-1417



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219123#comment-16219123
 ] 

ASF GitHub Bot commented on ARROW-1484:
---

wesm commented on a change in pull request #1245: ARROW-1484: [C++/Python] 
Implement casts between date, time, timestamp units
URL: https://github.com/apache/arrow/pull/1245#discussion_r146927485
 
 

 ##
 File path: cpp/src/arrow/compute/compute-test.cc
 ##
 @@ -270,6 +275,205 @@ TEST_F(TestCast, ToIntDowncastUnsafe) {
 options);
 }
 
+TEST_F(TestCast, TimestampToTimestamp) {
+  CastOptions options;
+
+  auto CheckTimestampCast = [this](
+  const CastOptions& options, TimeUnit::type from_unit, TimeUnit::type 
to_unit,
+  const std::vector& from_values, const std::vector& 
to_values,
+  const std::vector& is_valid) {
+CheckCase(
+timestamp(from_unit), from_values, is_valid, timestamp(to_unit), 
to_values,
+options);
+  };
+
+  vector is_valid = {true, false, true, true, true};
+
+  // Multiply promotions
+  vector v1 = {0, 100, 200, 1, 2};
+  vector e1 = {0, 10, 20, 1000, 2000};
+  CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MILLI, v1, e1, 
is_valid);
+
+  vector v2 = {0, 100, 200, 1, 2};
+  vector e2 = {0, 1L, 2L, 100, 200};
+  CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MICRO, v2, e2, 
is_valid);
+
+  vector v3 = {0, 100, 200, 1, 2};
+  vector e3 = {0, 1000L, 2000L, 10L, 
20L};
+  CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::NANO, v3, e3, 
is_valid);
+
+  vector v4 = {0, 100, 200, 1, 2};
+  vector e4 = {0, 10, 20, 1000, 2000};
+  CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::MICRO, v4, e4, 
is_valid);
+
+  vector v5 = {0, 100, 200, 1, 2};
+  vector e5 = {0, 1L, 2L, 100, 200};
+  CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::NANO, v5, e5, 
is_valid);
+
+  vector v6 = {0, 100, 200, 1, 2};
+  vector e6 = {0, 10, 20, 1000, 2000};
+  CheckTimestampCast(options, TimeUnit::MICRO, TimeUnit::NANO, v6, e6, 
is_valid);
+
+  // Zero copy
+  std::shared_ptr arr;
+  vector v7 = {0, 7, 2000, 1000, 0};
+  ArrayFromVector(timestamp(TimeUnit::SECOND), 
is_valid, v7,
+  &arr);
+  CheckZeroCopy(*arr, timestamp(TimeUnit::SECOND));
+
+  // Divide, truncate
+  vector v8 = {0, 100123, 200456, 1123, 2456};
+  vector e8 = {0, 100, 200, 1, 2};
+
+  options.allow_time_truncate = true;
 
 Review comment:
   Thanks for catching. I'll make `safe=True` set this option 
http://arrow.apache.org/docs/python/generated/pyarrow.lib.Array.html#pyarrow.lib.Array.cast


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement (safe and unsafe) casts between timestamps and times of 
> different units
> ---
>
> Key: ARROW-1484
> URL: https://issues.apache.org/jira/browse/ARROW-1484
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219116#comment-16219116
 ] 

ASF GitHub Bot commented on ARROW-1484:
---

BryanCutler commented on a change in pull request #1245: ARROW-1484: 
[C++/Python] Implement casts between date, time, timestamp units
URL: https://github.com/apache/arrow/pull/1245#discussion_r146926826
 
 

 ##
 File path: cpp/src/arrow/compute/compute-test.cc
 ##
 @@ -270,6 +275,205 @@ TEST_F(TestCast, ToIntDowncastUnsafe) {
 options);
 }
 
+TEST_F(TestCast, TimestampToTimestamp) {
+  CastOptions options;
+
+  auto CheckTimestampCast = [this](
+  const CastOptions& options, TimeUnit::type from_unit, TimeUnit::type 
to_unit,
+  const std::vector& from_values, const std::vector& 
to_values,
+  const std::vector& is_valid) {
+CheckCase(
+timestamp(from_unit), from_values, is_valid, timestamp(to_unit), 
to_values,
+options);
+  };
+
+  vector is_valid = {true, false, true, true, true};
+
+  // Multiply promotions
+  vector v1 = {0, 100, 200, 1, 2};
+  vector e1 = {0, 10, 20, 1000, 2000};
+  CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MILLI, v1, e1, 
is_valid);
+
+  vector v2 = {0, 100, 200, 1, 2};
+  vector e2 = {0, 1L, 2L, 100, 200};
+  CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MICRO, v2, e2, 
is_valid);
+
+  vector v3 = {0, 100, 200, 1, 2};
+  vector e3 = {0, 1000L, 2000L, 10L, 
20L};
+  CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::NANO, v3, e3, 
is_valid);
+
+  vector v4 = {0, 100, 200, 1, 2};
+  vector e4 = {0, 10, 20, 1000, 2000};
+  CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::MICRO, v4, e4, 
is_valid);
+
+  vector v5 = {0, 100, 200, 1, 2};
+  vector e5 = {0, 1L, 2L, 100, 200};
+  CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::NANO, v5, e5, 
is_valid);
+
+  vector v6 = {0, 100, 200, 1, 2};
+  vector e6 = {0, 10, 20, 1000, 2000};
+  CheckTimestampCast(options, TimeUnit::MICRO, TimeUnit::NANO, v6, e6, 
is_valid);
+
+  // Zero copy
+  std::shared_ptr arr;
+  vector v7 = {0, 7, 2000, 1000, 0};
+  ArrayFromVector(timestamp(TimeUnit::SECOND), 
is_valid, v7,
+  &arr);
+  CheckZeroCopy(*arr, timestamp(TimeUnit::SECOND));
+
+  // Divide, truncate
+  vector v8 = {0, 100123, 200456, 1123, 2456};
+  vector e8 = {0, 100, 200, 1, 2};
+
+  options.allow_time_truncate = true;
 
 Review comment:
   Does this option need to be set in pyarrow to prevent an error when 
truncating?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement (safe and unsafe) casts between timestamps and times of 
> different units
> ---
>
> Key: ARROW-1484
> URL: https://issues.apache.org/jira/browse/ARROW-1484
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1729) [C++] Upgrade clang bits to 5.0 once promoted to stable

2017-10-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1729:
---

 Summary: [C++] Upgrade clang bits to 5.0 once promoted to stable
 Key: ARROW-1729
 URL: https://issues.apache.org/jira/browse/ARROW-1729
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This includes our CI toolchain and pinned clang-format version. According to 
http://apt.llvm.org/ 5.0 is still the "qualification branch" where 4.0 is stable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219091#comment-16219091
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask 
check in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339400591
 
 
   @wesm Now fixed!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219090#comment-16219090
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check 
in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339400213
 
 
   Thanks! According to llvm.org, clang-5.0 is still the qualification branch 
(http://apt.llvm.org/) so whenever 5.0 is promoted to stable we'll upgrade our 
clang bits to 5.0


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1727:

Fix Version/s: 0.8.0

> [Format] Expand Arrow streaming format to permit new dictionaries and deltas 
> / additions to existing dictionaries
> -
>
> Key: ARROW-1727
> URL: https://issues.apache.org/jira/browse/ARROW-1727
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219088#comment-16219088
 ] 

Wes McKinney commented on ARROW-1727:
-

Yes, documentation and adding to the Flatbuffers schemas. Flatbuffers supports 
default values, so we could make the default NEW 

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L132

> [Format] Expand Arrow streaming format to permit new dictionaries and deltas 
> / additions to existing dictionaries
> -
>
> Key: ARROW-1727
> URL: https://issues.apache.org/jira/browse/ARROW-1727
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219085#comment-16219085
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask 
check in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339399427
 
 
   @wesm Sorry! I'm using clang-format-5!
   I'll fix!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219080#comment-16219080
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check 
in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339398833
 
 
   I ran clang-format 4.0 locally and got this diff 
https://github.com/wesm/arrow/commit/7547ac8e70b5279e44fe802bdbd241ad9a8f0d4a


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1660) pandas field values are messed up across rows

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219063#comment-16219063
 ] 

Wes McKinney commented on ARROW-1660:
-

I think it might be related to splicing together files. I'll write some tests 
and then close this issue; if you are able to reproduce in the future please 
let us know

> pandas field values are messed up across rows
> -
>
> Key: ARROW-1660
> URL: https://issues.apache.org/jira/browse/ARROW-1660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
> Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3
>Reporter: MIkhail Osckin
> Fix For: 0.8.0
>
>
> I have the following scala case class to store sparse matrix data to read it 
> later using python
> {code:java}
> case class CooVector(
> id: Int,
> row_ids: Seq[Int],
> rowsIdx: Seq[Int],
> colIdx: Seq[Int],
> data: Seq[Double])
> {code}
> I save the dataset of this type to multiple parquet files using spark and 
> then read it using pyarrow.parquet and convert the result to pandas dataset.
> The problem i have is that some values end up in wrong rows, for example, 
> row_ids might end up in wrong cooVector row. I have no idea what the reason 
> is but might be it is related to the fact that the fields are of variable 
> sizes. And everything is correct if i read it using spark. Also i checked 
> to_pydict method and the result is correct, so seems like the problem 
> somewhere in to_pandas method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1660) [Python] pandas field values are messed up across rows

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1660:
---

Assignee: Wes McKinney

> [Python] pandas field values are messed up across rows
> --
>
> Key: ARROW-1660
> URL: https://issues.apache.org/jira/browse/ARROW-1660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
> Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3
>Reporter: MIkhail Osckin
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> I have the following scala case class to store sparse matrix data to read it 
> later using python
> {code:java}
> case class CooVector(
> id: Int,
> row_ids: Seq[Int],
> rowsIdx: Seq[Int],
> colIdx: Seq[Int],
> data: Seq[Double])
> {code}
> I save the dataset of this type to multiple parquet files using spark and 
> then read it using pyarrow.parquet and convert the result to pandas dataset.
> The problem i have is that some values end up in wrong rows, for example, 
> row_ids might end up in wrong cooVector row. I have no idea what the reason 
> is but might be it is related to the fact that the fields are of variable 
> sizes. And everything is correct if i read it using spark. Also i checked 
> to_pydict method and the result is correct, so seems like the problem 
> somewhere in to_pandas method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1660) [Python] pandas field values are messed up across rows

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1660:

Summary: [Python] pandas field values are messed up across rows  (was: 
pandas field values are messed up across rows)

> [Python] pandas field values are messed up across rows
> --
>
> Key: ARROW-1660
> URL: https://issues.apache.org/jira/browse/ARROW-1660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
> Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3
>Reporter: MIkhail Osckin
> Fix For: 0.8.0
>
>
> I have the following scala case class to store sparse matrix data to read it 
> later using python
> {code:java}
> case class CooVector(
> id: Int,
> row_ids: Seq[Int],
> rowsIdx: Seq[Int],
> colIdx: Seq[Int],
> data: Seq[Double])
> {code}
> I save the dataset of this type to multiple parquet files using spark and 
> then read it using pyarrow.parquet and convert the result to pandas dataset.
> The problem i have is that some values end up in wrong rows, for example, 
> row_ids might end up in wrong cooVector row. I have no idea what the reason 
> is but might be it is related to the fact that the fields are of variable 
> sizes. And everything is correct if i read it using spark. Also i checked 
> to_pydict method and the result is correct, so seems like the problem 
> somewhere in to_pandas method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (ARROW-1367) [Website] Divide CHANGELOG issues by component and add subheaders

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1367.
---
Resolution: Won't Fix

Since some issues may be in multiple components, and some not at all, this is a 
bit complex to generate, for unclear benefit. Users can always browse the fix 
versions by component on JIRA

> [Website] Divide CHANGELOG issues by component and add subheaders
> -
>
> Key: ARROW-1367
> URL: https://issues.apache.org/jira/browse/ARROW-1367
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> This will make the changelog on the website more readable. JIRAs may appear 
> in more than one component listing. We should practice good JIRA hygiene by 
> associating all JIRAs with at least one component. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219051#comment-16219051
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check 
in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339395625
 
 
   I'm surprised by some of the formatting changes, are you using 
clang-format-4.0? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries

2017-10-25 Thread Brian Hulette (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219039#comment-16219039
 ] 

Brian Hulette commented on ARROW-1727:
--

Is the scope of this ticket just making the appropriate documentation and/or 
flatbuffer spec changes in 
[/format|https://github.com/apache/arrow/tree/master/format]?

I like the idea idea of including a {{NEW/DELTA}} flag in the dictionary batch. 
Is there a way the flag could be optional and default to {{NEW}} for backwards 
compatibility? or is that not worth the trouble?

> [Format] Expand Arrow streaming format to permit new dictionaries and deltas 
> / additions to existing dictionaries
> -
>
> Key: ARROW-1727
> URL: https://issues.apache.org/jira/browse/ARROW-1727
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219033#comment-16219033
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask 
check in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339393634
 
 
   @wesm Fixed the whole lint issues!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-587) Add JIRA fix version to merge tool

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219021#comment-16219021
 ] 

ASF GitHub Bot commented on ARROW-587:
--

wesm opened a new pull request #1248: ARROW-587: Add fix version to PR merge 
tool 
URL: https://github.com/apache/arrow/pull/1248
 
 
   This was ported from parquet-mr/parquet-cpp. We should merge a separate 
patch with this branch before committing this


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add JIRA fix version to merge tool
> --
>
> Key: ARROW-587
> URL: https://issues.apache.org/jira/browse/ARROW-587
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Like parquet-mr's tool. This will make releases less painful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-587) Add JIRA fix version to merge tool

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-587:
-
Labels: pull-request-available  (was: )

> Add JIRA fix version to merge tool
> --
>
> Key: ARROW-587
> URL: https://issues.apache.org/jira/browse/ARROW-587
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Like parquet-mr's tool. This will make releases less painful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1367) [Website] Divide CHANGELOG issues by component and add subheaders

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1367:
---

Assignee: Wes McKinney

> [Website] Divide CHANGELOG issues by component and add subheaders
> -
>
> Key: ARROW-1367
> URL: https://issues.apache.org/jira/browse/ARROW-1367
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> This will make the changelog on the website more readable. JIRAs may appear 
> in more than one component listing. We should practice good JIRA hygiene by 
> associating all JIRAs with at least one component. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-587) Add JIRA fix version to merge tool

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-587:
--

Assignee: Wes McKinney

> Add JIRA fix version to merge tool
> --
>
> Key: ARROW-587
> URL: https://issues.apache.org/jira/browse/ARROW-587
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> Like parquet-mr's tool. This will make releases less painful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1721:
---

Assignee: Wes McKinney

> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218996#comment-16218996
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check 
in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#issuecomment-339387796
 
 
   Thank you for doing this!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1721:
---

Assignee: Licht Takeuchi  (was: Wes McKinney)

> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218993#comment-16218993
 ] 

ASF GitHub Bot commented on ARROW-1723:
---

JohnPJenkins commented on issue #1244: ARROW-1723: [C++] add ARROW_STATIC to 
mark static libs on Windows
URL: https://github.com/apache/arrow/pull/1244#issuecomment-339387674
 
 
   Reworked the commit based on discussion - Windows builds now use separate 
compilation with a conditional ARROW_STATIC macro for static and shared library 
targets (Unix remains the same).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Windows: __declspec(dllexport) specified when building arrow static library
> ---
>
> Key: ARROW-1723
> URL: https://issues.apache.org/jira/browse/ARROW-1723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: John Jenkins
>  Labels: pull-request-available
>
> As I understand it, dllexport/dllimport should be left out when building and 
> using static libraries on Windows. A PR will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218991#comment-16218991
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

wesm commented on a change in pull request #1246: ARROW-1721: [Python] 
Implement null-mask check in places where it isn't supported in 
numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246#discussion_r146910978
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1029,6 +1033,44 @@ Status LoopPySequence(PyObject* sequence, T func) {
   return Status::OK();
 }
 
+template 
+Status LoopPySequenceWithMasks(
+PyObject* sequence,
+const Ndarray1DIndexer& mask_values,
+bool have_mask,
+T func
+) {
 
 Review comment:
   Can you run clang-format? (`make format` or `ninja format`). This should 
also fix the cpplint failure in CI . See 
https://github.com/apache/arrow/tree/master/cpp#continuous-integration


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1728) [C++] Run clang-format checks in Travis CI

2017-10-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1728:
---

 Summary: [C++] Run clang-format checks in Travis CI
 Key: ARROW-1728
 URL: https://issues.apache.org/jira/browse/ARROW-1728
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.8.0


I think it's reasonable to expect contributors to run clang-format on their 
code. This may lead to a higher number of failed builds but will eliminate 
noise diffs in unrelated patches



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1726) [GLib] Add setup description to verify C GLib build

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218978#comment-16218978
 ] 

ASF GitHub Bot commented on ARROW-1726:
---

wesm closed pull request #1247: ARROW-1726: [GLib] Add setup description to 
verify C GLib build
URL: https://github.com/apache/arrow/pull/1247
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/dev/release/VERIFY.md b/dev/release/VERIFY.md
index 3f073e408..5b441ac13 100644
--- a/dev/release/VERIFY.md
+++ b/dev/release/VERIFY.md
@@ -22,4 +22,55 @@
 ## Windows
 
 We've provided a convenience script for verifying the C++ and Python builds on
-Windows. Read the comments in `verify-release-candidate.bat` for instructions
\ No newline at end of file
+Windows. Read the comments in `verify-release-candidate.bat` for instructions.
+
+## Linux and macOS
+
+We've provided a convenience script for verifying the C++, Python, C
+GLib, Java and JavaScript builds on Linux and macOS. Read the comments in
+`verify-release-candidate.sh` for instructions.
+
+### C GLib
+
+You need the followings to verify C GLib build:
+
+  * GLib
+  * GObject Introspection
+  * Ruby (not EOL-ed version is required)
+  * gobject-introspection gem
+  * test-unit gem
+
+You can install them by the followings on Debian GNU/Linux and Ubuntu:
+
+```console
+% sudo apt install -y -V libgirepository1.0-dev ruby-dev
+% sudo gem install gobject-introspection test-unit
+```
+
+You can install them by the followings on CentOS:
+
+```console
+% sudo yum install -y gobject-introspection-devel
+% git clone https://github.com/sstephenson/rbenv.git ~/.rbenv
+% git clone https://github.com/sstephenson/ruby-build.git 
~/.rbenv/plugins/ruby-build
+% echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bash_profile
+% echo 'eval "$(rbenv init -)"' >> ~/.bash_profile
+% exec ${SHELL} --login
+% sudo yum install -y gcc make patch openssl-devel readline-devel zlib-devel
+% rbenv install 2.4.2
+% rbenv global 2.4.2
+% gem install gobject-introspection test-unit
+```
+
+You can install them by the followings on macOS:
+
+```console
+% brew install -y gobject-introspection
+% gem install gobject-introspection test-unit
+```
+
+You need to set `PKG_CONFIG_PATH` to find libffi on macOS:
+
+```console
+% export PKG_CONFIG_PATH=$(brew --prefix libffi)/lib/pkgconfig:$PKG_CONFIG_PATH
+```


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [GLib] Add setup description to verify C GLib build
> ---
>
> Key: ARROW-1726
> URL: https://issues.apache.org/jira/browse/ARROW-1726
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1726) [GLib] Add setup description to verify C GLib build

2017-10-25 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1726.
-
Resolution: Fixed

Issue resolved by pull request 1247
[https://github.com/apache/arrow/pull/1247]

> [GLib] Add setup description to verify C GLib build
> ---
>
> Key: ARROW-1726
> URL: https://issues.apache.org/jira/browse/ARROW-1726
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218927#comment-16218927
 ] 

ASF GitHub Bot commented on ARROW-473:
--

AnkitAggarwalPEC commented on issue #1031: WIP ARROW-473: [C++/Python] Add 
public API for retrieving block locations for a particular HDFS file
URL: https://github.com/apache/arrow/pull/1031#issuecomment-339375539
 
 
   @cpcloud Is there any environment that is needed to set before this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add public API for retrieving block locations for a particular 
> HDFS file
> -
>
> Key: ARROW-473
> URL: https://issues.apache.org/jira/browse/ARROW-473
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This is necessary for applications looking to schedule data-local work. 
> libhdfs does not have APIs to request the block locations directly, so we 
> need to see if the {{hdfsGetHosts}} function will do what we need. For 
> libhdfs3 there is a public API function 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218923#comment-16218923
 ] 

ASF GitHub Bot commented on ARROW-473:
--

AnkitAggarwalPEC commented on issue #1031: WIP ARROW-473: [C++/Python] Add 
public API for retrieving block locations for a particular HDFS file
URL: https://github.com/apache/arrow/pull/1031#issuecomment-339375220
 
 
   @cpcloud I'm running the script for last 10 minutes But it is still showing 
the same error
   
   Could not execute command: select VERSION()
   Starting Impala Shell without Kerberos authentication
   Connected to arrow-hdfs:21000
   Server version: impalad version 2.9.0-cdh5.12.0 RELEASE (build 
03c6ddbdcec39238be4f5b14a300d5c4f576097e)
   Query: select VERSION()
   Query submitted at: 2017-10-25 15:43:59 (Coordinator: 
http://arrow-hdfs:25000)
   ERROR: AnalysisException: This Impala daemon is not ready to accept user 
requests. Status: Waiting for catalog update from the StateStore.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add public API for retrieving block locations for a particular 
> HDFS file
> -
>
> Key: ARROW-473
> URL: https://issues.apache.org/jira/browse/ARROW-473
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This is necessary for applications looking to schedule data-local work. 
> libhdfs does not have APIs to request the block locations directly, so we 
> need to see if the {{hdfsGetHosts}} function will do what we need. For 
> libhdfs3 there is a public API function 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries

2017-10-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1727:
---

 Summary: [Format] Expand Arrow streaming format to permit new 
dictionaries and deltas / additions to existing dictionaries
 Key: ARROW-1727
 URL: https://issues.apache.org/jira/browse/ARROW-1727
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1726) [GLib] Add setup description to verify C GLib build

2017-10-25 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-1726:
---

 Summary: [GLib] Add setup description to verify C GLib build
 Key: ARROW-1726
 URL: https://issues.apache.org/jira/browse/ARROW-1726
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
Priority: Minor
 Fix For: 0.8.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1726) [GLib] Add setup description to verify C GLib build

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1726:
--
Labels: pull-request-available  (was: )

> [GLib] Add setup description to verify C GLib build
> ---
>
> Key: ARROW-1726
> URL: https://issues.apache.org/jira/browse/ARROW-1726
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1726) [GLib] Add setup description to verify C GLib build

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218708#comment-16218708
 ] 

ASF GitHub Bot commented on ARROW-1726:
---

kou opened a new pull request #1247: ARROW-1726: [GLib] Add setup description 
to verify C GLib build
URL: https://github.com/apache/arrow/pull/1247
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [GLib] Add setup description to verify C GLib build
> ---
>
> Key: ARROW-1726
> URL: https://issues.apache.org/jira/browse/ARROW-1726
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1721:
--
Labels: pull-request-available  (was: )

> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218681#comment-16218681
 ] 

ASF GitHub Bot commented on ARROW-1721:
---

Licht-T opened a new pull request #1246: ARROW-1721: [Python] Implement 
null-mask check in places where it isn't supported in numpy_to_arrow.cc
URL: https://github.com/apache/arrow/pull/1246
 
 
   This closes 
[ARROW-1721](https://issues.apache.org/jira/projects/ARROW/issues/ARROW-1721).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support null mask in places where it isn't supported in 
> numpy_to_arrow.cc
> --
>
> Key: ARROW-1721
> URL: https://issues.apache.org/jira/browse/ARROW-1721
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see https://github.com/apache/spark/pull/18664#discussion_r146472109 for 
> SPARK-21375



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218645#comment-16218645
 ] 

ASF GitHub Bot commented on ARROW-1723:
---

wesm commented on issue #1244: ARROW-1723: [C++] add ARROW_STATIC to mark 
static libs on Windows
URL: https://github.com/apache/arrow/pull/1244#issuecomment-33954
 
 
   @MaxRis yeah, I agree on that. In case we support other build systems (like 
Bazel) in the future it would be better to have the exports explicit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Windows: __declspec(dllexport) specified when building arrow static library
> ---
>
> Key: ARROW-1723
> URL: https://issues.apache.org/jira/browse/ARROW-1723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: John Jenkins
>  Labels: pull-request-available
>
> As I understand it, dllexport/dllimport should be left out when building and 
> using static libraries on Windows. A PR will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1209) [C++] Implement converter between Arrow record batches and Avro records

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218641#comment-16218641
 ] 

ASF GitHub Bot commented on ARROW-1209:
---

mariusvniekerk commented on issue #1026: ARROW-1209: [C++] [WIP] Support for 
reading avro from an AvroFileReader
URL: https://github.com/apache/arrow/pull/1026#issuecomment-339332258
 
 
   yeah i'll rebase this and see what needs to change.  Think we were missing 
libjansson last time i touched this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement converter between Arrow record batches and Avro records
> ---
>
> Key: ARROW-1209
> URL: https://issues.apache.org/jira/browse/ARROW-1209
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This would be useful for streaming systems that need to consume or produce 
> Avro in C/C++



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218627#comment-16218627
 ] 

ASF GitHub Bot commented on ARROW-1723:
---

MaxRis commented on issue #1244: ARROW-1723: [C++] add ARROW_STATIC to mark 
static libs on Windows
URL: https://github.com/apache/arrow/pull/1244#issuecomment-339330749
 
 
   @wesm some [more reading on what it takes to use 
WINDOWS_EXPORT_ALL_SYMBOL](https://blog.kitware.com/create-dlls-on-windows-without-declspec-using-new-cmake-export-all-feature/)
 (in the bottom of article)
   We might try to use it, but it might be not a good thing to rely on feature 
in CMake and to not have compatibility with any other build tools on Windows.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Windows: __declspec(dllexport) specified when building arrow static library
> ---
>
> Key: ARROW-1723
> URL: https://issues.apache.org/jira/browse/ARROW-1723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: John Jenkins
>  Labels: pull-request-available
>
> As I understand it, dllexport/dllimport should be left out when building and 
> using static libraries on Windows. A PR will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218613#comment-16218613
 ] 

ASF GitHub Bot commented on ARROW-1723:
---

JohnPJenkins commented on a change in pull request #1244: ARROW-1723: [C++] add 
ARROW_STATIC to mark static libs on Windows
URL: https://github.com/apache/arrow/pull/1244#discussion_r146854730
 
 

 ##
 File path: cpp/cmake_modules/BuildUtils.cmake
 ##
 @@ -165,6 +165,8 @@ function(ADD_ARROW_LIB LIB_NAME)
   LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}"
   OUTPUT_NAME ${LIB_NAME_STATIC})
 
+  target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC)
 
 Review comment:
   That makes sense - looking more closely at the cmake file, the unix builds 
are unconditionally using PIC, so no issues there.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Windows: __declspec(dllexport) specified when building arrow static library
> ---
>
> Key: ARROW-1723
> URL: https://issues.apache.org/jira/browse/ARROW-1723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: John Jenkins
>  Labels: pull-request-available
>
> As I understand it, dllexport/dllimport should be left out when building and 
> using static libraries on Windows. A PR will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218610#comment-16218610
 ] 

ASF GitHub Bot commented on ARROW-1723:
---

wesm commented on a change in pull request #1244: ARROW-1723: [C++] add 
ARROW_STATIC to mark static libs on Windows
URL: https://github.com/apache/arrow/pull/1244#discussion_r146854212
 
 

 ##
 File path: cpp/cmake_modules/BuildUtils.cmake
 ##
 @@ -165,6 +165,8 @@ function(ADD_ARROW_LIB LIB_NAME)
   LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}"
   OUTPUT_NAME ${LIB_NAME_STATIC})
 
+  target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC)
 
 Review comment:
   The objlib thing is an optimization for Unix/macOS so this part could be 
skipped on Windows


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Windows: __declspec(dllexport) specified when building arrow static library
> ---
>
> Key: ARROW-1723
> URL: https://issues.apache.org/jira/browse/ARROW-1723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: John Jenkins
>  Labels: pull-request-available
>
> As I understand it, dllexport/dllimport should be left out when building and 
> using static libraries on Windows. A PR will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1710) [Java] Decide what to do with non-nullable vectors in new vector class hierarchy

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218606#comment-16218606
 ] 

Wes McKinney commented on ARROW-1710:
-

See https://github.com/apache/arrow/blob/master/format/Layout.md#null-bitmaps. 
"Arrays having a 0 null count may choose to not allocate the null bitmap.". So 
when there are no nulls, it is not necessary to create a BitVector. It is also 
not necessary to populate the bit vector, so as you say waiting until the first 
null to create the bitmap might be the way to go.

> [Java] Decide what to do with non-nullable vectors in new vector class 
> hierarchy 
> -
>
> Key: ARROW-1710
> URL: https://issues.apache.org/jira/browse/ARROW-1710
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java - Vectors
>Reporter: Li Jin
> Fix For: 0.8.0
>
>
> So far the consensus seems to be remove all non-nullable vectors. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218601#comment-16218601
 ] 

ASF GitHub Bot commented on ARROW-1723:
---

MaxRis commented on a change in pull request #1244: ARROW-1723: [C++] add 
ARROW_STATIC to mark static libs on Windows
URL: https://github.com/apache/arrow/pull/1244#discussion_r146852361
 
 

 ##
 File path: cpp/cmake_modules/BuildUtils.cmake
 ##
 @@ -165,6 +165,8 @@ function(ADD_ARROW_LIB LIB_NAME)
   LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}"
   OUTPUT_NAME ${LIB_NAME_STATIC})
 
+  target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC)
 
 Review comment:
   @JohnPJenkins it seems logic should be changed only for Windows (to not 
increase compilation time on Unix). And on Windows maybe it makes sense to 
build directly from sources in cmake script, since it's not possible to reuse 
object files.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Windows: __declspec(dllexport) specified when building arrow static library
> ---
>
> Key: ARROW-1723
> URL: https://issues.apache.org/jira/browse/ARROW-1723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: John Jenkins
>  Labels: pull-request-available
>
> As I understand it, dllexport/dllimport should be left out when building and 
> using static libraries on Windows. A PR will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1588) [C++/Format] Harden Decimal Format

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218591#comment-16218591
 ] 

ASF GitHub Bot commented on ARROW-1588:
---

wesm closed pull request #1211: ARROW-1588: [C++/Format] Harden Decimal Format
URL: https://github.com/apache/arrow/pull/1211
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/util/CMakeLists.txt 
b/cpp/src/arrow/util/CMakeLists.txt
index 1178c658c..5df5e748f 100644
--- a/cpp/src/arrow/util/CMakeLists.txt
+++ b/cpp/src/arrow/util/CMakeLists.txt
@@ -42,6 +42,7 @@ install(FILES
   rle-encoding.h
   sse-util.h
   stl.h
+  type_traits.h
   visibility.h
   DESTINATION include/arrow/util)
 
diff --git a/cpp/src/arrow/util/bit-util-test.cc 
b/cpp/src/arrow/util/bit-util-test.cc
index 5a66d7e85..92bdcb5fc 100644
--- a/cpp/src/arrow/util/bit-util-test.cc
+++ b/cpp/src/arrow/util/bit-util-test.cc
@@ -28,7 +28,6 @@
 
 #include "arrow/buffer.h"
 #include "arrow/memory_pool.h"
-#include "arrow/status.h"
 #include "arrow/test-util.h"
 #include "arrow/util/bit-stream-utils.h"
 #include "arrow/util/bit-util.h"
@@ -334,4 +333,36 @@ TEST(BitStreamUtil, ZigZag) {
   TestZigZag(-std::numeric_limits::max());
 }
 
+TEST(BitUtil, RoundTripLittleEndianTest) {
+  uint64_t value = 0xFF;
+
+#if ARROW_LITTLE_ENDIAN
+  uint64_t expected = value;
+#else
+  uint64_t expected = std::numeric_limits::max() << 56;
+#endif
+
+  uint64_t little_endian_result = BitUtil::ToLittleEndian(value);
+  ASSERT_EQ(expected, little_endian_result);
+
+  uint64_t from_little_endian = 
BitUtil::FromLittleEndian(little_endian_result);
+  ASSERT_EQ(value, from_little_endian);
+}
+
+TEST(BitUtil, RoundTripBigEndianTest) {
+  uint64_t value = 0xFF;
+
+#if ARROW_LITTLE_ENDIAN
+  uint64_t expected = std::numeric_limits::max() << 56;
+#else
+  uint64_t expected = value;
+#endif
+
+  uint64_t big_endian_result = BitUtil::ToBigEndian(value);
+  ASSERT_EQ(expected, big_endian_result);
+
+  uint64_t from_big_endian = BitUtil::FromBigEndian(big_endian_result);
+  ASSERT_EQ(value, from_big_endian);
+}
+
 }  // namespace arrow
diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h
index 2509de21f..8043f90cc 100644
--- a/cpp/src/arrow/util/bit-util.h
+++ b/cpp/src/arrow/util/bit-util.h
@@ -56,6 +56,7 @@
 #include 
 
 #include "arrow/util/macros.h"
+#include "arrow/util/type_traits.h"
 #include "arrow/util/visibility.h"
 
 #ifdef ARROW_USE_SSE
@@ -305,7 +306,7 @@ static inline uint32_t ByteSwap(uint32_t value) {
   return static_cast(ARROW_BYTE_SWAP32(value));
 }
 static inline int16_t ByteSwap(int16_t value) {
-  constexpr int16_t m = static_cast(0xff);
+  constexpr auto m = static_cast(0xff);
   return static_cast(((value >> 8) & m) | ((value & m) << 8));
 }
 static inline uint16_t ByteSwap(uint16_t value) {
@@ -331,8 +332,8 @@ static inline void ByteSwap(void* dst, const void* src, int 
len) {
   break;
   }
 
-  uint8_t* d = reinterpret_cast(dst);
-  const uint8_t* s = reinterpret_cast(src);
+  auto d = reinterpret_cast(dst);
+  auto s = reinterpret_cast(src);
   for (int i = 0; i < len; ++i) {
 d[i] = s[len - i - 1];
   }
@@ -341,36 +342,57 @@ static inline void ByteSwap(void* dst, const void* src, 
int len) {
 /// Converts to big endian format (if not already in big endian) from the
 /// machine's native endian format.
 #if ARROW_LITTLE_ENDIAN
-static inline int64_t ToBigEndian(int64_t value) { return ByteSwap(value); }
-static inline uint64_t ToBigEndian(uint64_t value) { return ByteSwap(value); }
-static inline int32_t ToBigEndian(int32_t value) { return ByteSwap(value); }
-static inline uint32_t ToBigEndian(uint32_t value) { return ByteSwap(value); }
-static inline int16_t ToBigEndian(int16_t value) { return ByteSwap(value); }
-static inline uint16_t ToBigEndian(uint16_t value) { return ByteSwap(value); }
+template >
+static inline T ToBigEndian(T value) {
+  return ByteSwap(value);
+}
+
+template >
+static inline T ToLittleEndian(T value) {
+  return value;
+}
 #else
-static inline int64_t ToBigEndian(int64_t val) { return val; }
-static inline uint64_t ToBigEndian(uint64_t val) { return val; }
-static inline int32_t ToBigEndian(int32_t val) { return val; }
-static inline uint32_t ToBigEndian(uint32_t val) { return val; }
-static inline int16_t ToBigEndian(int16_t val) { return val; }
-static inline uint16_t ToBigEndian(uint16_t val) { return val; }
+template >
+static inline T ToBigEndian(T value) {
+  return value;
+}
 #endif
 
 /// Converts from big endian format to the machine's native endian format.
 #if ARROW_LITTLE_ENDIAN
-static inline int64_t FromBigEndian(int64_t value) { return ByteSwap(value); }
-static inline uint64_t Fro

  1   2   >