[jira] [Commented] (ARROW-1500) [C++] Result of ftruncate ignored in MemoryMappedFile::Create

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172769#comment-16172769
 ] 

ASF GitHub Bot commented on ARROW-1500:
---

Github user amirma commented on the issue:

https://github.com/apache/arrow/pull/1116
  
@wesm Rebased. 


> [C++] Result of ftruncate ignored in MemoryMappedFile::Create
> -
>
> Key: ARROW-1500
> URL: https://issues.apache.org/jira/browse/ARROW-1500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Amir Malekpour
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Observed in gcc 5.4.0 release build



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1553) [JAVA] Implement setInitialCapacity for MapWriter and pass on this capacity during lazy creation of child vectors

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172757#comment-16172757
 ] 

ASF GitHub Bot commented on ARROW-1553:
---

Github user siddharthteotia commented on the issue:

https://github.com/apache/arrow/pull/1113
  
Can this be merged?


> [JAVA] Implement setInitialCapacity for MapWriter and pass on this capacity 
> during lazy creation of child vectors
> -
>
> Key: ARROW-1553
> URL: https://issues.apache.org/jira/browse/ARROW-1553
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172749#comment-16172749
 ] 

ASF GitHub Bot commented on ARROW-1557:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1117
  
here's a fix to cherry pick 
https://github.com/wesm/arrow/commit/965a560867f45025dcbfe50c572593faa7d7cb33


> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Assignee: Tom Augspurger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {code}
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> {code}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172745#comment-16172745
 ] 

ASF GitHub Bot commented on ARROW-1557:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1117
  
Appears there is a test failure that was exposed by this patch, can you fix?


> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Assignee: Tom Augspurger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {code}
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> {code}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1500) [C++] Result of ftruncate ignored in MemoryMappedFile::Create

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172744#comment-16172744
 ] 

ASF GitHub Bot commented on ARROW-1500:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1116
  
Can you rebase? Not sure why there's a merge conflict now


> [C++] Result of ftruncate ignored in MemoryMappedFile::Create
> -
>
> Key: ARROW-1500
> URL: https://issues.apache.org/jira/browse/ARROW-1500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Amir Malekpour
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Observed in gcc 5.4.0 release build



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-269) UnionVector getBuffers method does not include typevector

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-269:
---
Fix Version/s: (was: 1.0.0)
   0.7.0

> UnionVector getBuffers method does not include typevector
> -
>
> Key: ARROW-269
> URL: https://issues.apache.org/jira/browse/ARROW-269
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
> Fix For: 0.7.0
>
>
> Only the interMapVecgtor's buffers are returned currently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-269) UnionVector getBuffers method does not include typevector

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-269.

Resolution: Fixed
  Assignee: Steven Phillips

https://github.com/apache/arrow/commit/ec51d566708f5d6ea0a94a6d53152dc8cc98d6aa

> UnionVector getBuffers method does not include typevector
> -
>
> Key: ARROW-269
> URL: https://issues.apache.org/jira/browse/ARROW-269
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
>Assignee: Steven Phillips
> Fix For: 0.7.0
>
>
> Only the interMapVecgtor's buffers are returned currently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-269) UnionVector getBuffers method does not include typevector

2017-09-19 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172700#comment-16172700
 ] 

Li Jin commented on ARROW-269:
--

[~wesmckinn] this is fixed by 
https://github.com/apache/arrow/commit/ec51d566708f5d6ea0a94a6d53152dc8cc98d6aa

> UnionVector getBuffers method does not include typevector
> -
>
> Key: ARROW-269
> URL: https://issues.apache.org/jira/browse/ARROW-269
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
> Fix For: 1.0.0
>
>
> Only the interMapVecgtor's buffers are returned currently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1554.
-
Resolution: Fixed

Issue resolved by pull request 1115
[https://github.com/apache/arrow/pull/1115]

> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172687#comment-16172687
 ] 

ASF GitHub Bot commented on ARROW-1554:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1115
  
+1. The Travis failure appears due to a transient apt problem


> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1192) [JAVA] Improve splitAndTransfer performance for List and Union vectors

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172693#comment-16172693
 ] 

ASF GitHub Bot commented on ARROW-1192:
---

Github user asfgit closed the pull request at:

https://github.com/apache/arrow/pull/819


> [JAVA] Improve splitAndTransfer performance for List and Union vectors
> --
>
> Key: ARROW-1192
> URL: https://issues.apache.org/jira/browse/ARROW-1192
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Most vector implementations slice the underlying buffer for splitAndTransfer, 
> but ListVector and UnionVector copy data into a new buffer. We should enhance 
> these to use slice as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172692#comment-16172692
 ] 

ASF GitHub Bot commented on ARROW-1554:
---

Github user asfgit closed the pull request at:

https://github.com/apache/arrow/pull/1115


> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1557:
---

Assignee: Tom Augspurger

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Assignee: Tom Augspurger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {code}
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> {code}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1536) [C++] Do not transitively depend on libboost_system

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1536:
---

Assignee: Deepak Majeti

> [C++] Do not transitively depend on libboost_system
> ---
>
> Key: ARROW-1536
> URL: https://issues.apache.org/jira/browse/ARROW-1536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.0
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We picked up this dependency recently. I don't think this is a blocker for 
> 0.7.0, but it impacts static linkers (e.g. linkers of parquet-cpp)
> This was introduced in ARROW-1339 
> https://github.com/apache/arrow/commit/94b7cfafae0bda8f68ee3e5e9702c954b0116203
> cc [~mdeepak]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1536) [C++] Do not transitively depend on libboost_system

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172684#comment-16172684
 ] 

ASF GitHub Bot commented on ARROW-1536:
---

Github user asfgit closed the pull request at:

https://github.com/apache/arrow/pull/1105


> [C++] Do not transitively depend on libboost_system
> ---
>
> Key: ARROW-1536
> URL: https://issues.apache.org/jira/browse/ARROW-1536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We picked up this dependency recently. I don't think this is a blocker for 
> 0.7.0, but it impacts static linkers (e.g. linkers of parquet-cpp)
> This was introduced in ARROW-1339 
> https://github.com/apache/arrow/commit/94b7cfafae0bda8f68ee3e5e9702c954b0116203
> cc [~mdeepak]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1536) [C++] Do not transitively depend on libboost_system

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1536.
-
Resolution: Fixed

Issue resolved by pull request 1105
[https://github.com/apache/arrow/pull/1105]

> [C++] Do not transitively depend on libboost_system
> ---
>
> Key: ARROW-1536
> URL: https://issues.apache.org/jira/browse/ARROW-1536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We picked up this dependency recently. I don't think this is a blocker for 
> 0.7.0, but it impacts static linkers (e.g. linkers of parquet-cpp)
> This was introduced in ARROW-1339 
> https://github.com/apache/arrow/commit/94b7cfafae0bda8f68ee3e5e9702c954b0116203
> cc [~mdeepak]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1577) [JS] Package release script for NPM modules

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1577:
---

 Summary: [JS] Package release script for NPM modules
 Key: ARROW-1577
 URL: https://issues.apache.org/jira/browse/ARROW-1577
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.8.0
Reporter: Wes McKinney


Since the NPM JavaScript module may wish to release more frequently than the 
main Arrow "monorepo", we should create a script to produce signed NPM 
artifacts to use for voting:

* Update metadata for new version
* Run unit tests
* Create package tarballs with NPM
* GPG sign and create md5 and sha512 checksum files
* Upload to Apache dev SVN

i.e. like https://github.com/apache/arrow/blob/master/dev/release/02-source.sh, 
but only for JavaScript.

We will also want to write instructions for Arrow developers to verify the 
tarballs to streamline the release votes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1548) [GLib] Support build append in builder

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172629#comment-16172629
 ] 

ASF GitHub Bot commented on ARROW-1548:
---

Github user kou commented on the issue:

https://github.com/apache/arrow/pull/1110
  
Emacs very helps me. :)


> [GLib] Support build append in builder
> --
>
> Key: ARROW-1548
> URL: https://issues.apache.org/jira/browse/ARROW-1548
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> It improves performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1209) [C++] Implement converter between Arrow record batches and Avro records

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172621#comment-16172621
 ] 

ASF GitHub Bot commented on ARROW-1209:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1026
  
Hm, yeah I'm looking at avro-c and it's not very Windows-friendly. We can 
use FILE* on Windows in Arrow but that won't work on files over 2GB. But maybe 
that's OK. 


> [C++] Implement converter between Arrow record batches and Avro records
> ---
>
> Key: ARROW-1209
> URL: https://issues.apache.org/jira/browse/ARROW-1209
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: pull-request-available
>
> This would be useful for streaming systems that need to consume or produce 
> Avro in C/C++



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1209) [C++] Implement converter between Arrow record batches and Avro records

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172610#comment-16172610
 ] 

ASF GitHub Bot commented on ARROW-1209:
---

Github user mariusvniekerk commented on the issue:

https://github.com/apache/arrow/pull/1026
  
cyavro provides support for python file-like objects by basically making a 
void* and using fmemopen on it to get the FILE*



> [C++] Implement converter between Arrow record batches and Avro records
> ---
>
> Key: ARROW-1209
> URL: https://issues.apache.org/jira/browse/ARROW-1209
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: pull-request-available
>
> This would be useful for streaming systems that need to consume or produce 
> Avro in C/C++



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1209) [C++] Implement converter between Arrow record batches and Avro records

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172609#comment-16172609
 ] 

ASF GitHub Bot commented on ARROW-1209:
---

Github user mariusvniekerk commented on the issue:

https://github.com/apache/arrow/pull/1026
  
Yeah the implementation in impala seems to provide its own codecs.  The cpp 
implementation in libavro-cpp doesn't support all the codecs yet so i can see 
why impala/kudu reimplemented these.  

I assume that the impala cpp implementation is too tied to LLVM to be 
easily moved upstream to avro-cpp itself?


> [C++] Implement converter between Arrow record batches and Avro records
> ---
>
> Key: ARROW-1209
> URL: https://issues.apache.org/jira/browse/ARROW-1209
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: pull-request-available
>
> This would be useful for streaming systems that need to consume or produce 
> Avro in C/C++



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1557:
--
Labels: pull-request-available  (was: )

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {code}
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> {code}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172608#comment-16172608
 ] 

ASF GitHub Bot commented on ARROW-1557:
---

GitHub user TomAugspurger opened a pull request:

https://github.com/apache/arrow/pull/1117

ARROW-1557 [Python] Validate names length in Table.from_arrays

We now raise a ValueError when the length of the names doesn't match
the length of the arrays.

```python
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
names=['a', 'b', 'c'])
---
ValueErrorTraceback (most recent call last)
 in ()
> 1 pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
names=['a', 'b', 'c'])

table.pxi in pyarrow.lib.Table.from_arrays()

table.pxi in pyarrow.lib._schema_from_arrays()

ValueError: Length of names (3) does not match length of arrays (2)
```

This affected `RecordBatch.from_arrays` and `Table.from_arrays`.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/TomAugspurger/arrow validate-names

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/arrow/pull/1117.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1117


commit ed74d52249fabde739cf0599be0210c818b5d272
Author: Tom Augspurger 
Date:   2017-09-20T01:44:44Z

ARROW-1557 [Python] Validate names length in Table.from_arrays

We now raise a ValueError when the length of the names doesn't match
the length of the arrays.




> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {code}
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> {code}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1576) [Python] Add utility functions (or a richer type hierachy) for checking whether data type instances are members of various type classes

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172571#comment-16172571
 ] 

Wes McKinney commented on ARROW-1576:
-

cf https://github.com/mapd/pymapd/pull/50#discussion_r139854270

> [Python] Add utility functions (or a richer type hierachy) for checking 
> whether data type instances are members of various type classes
> ---
>
> Key: ARROW-1576
> URL: https://issues.apache.org/jira/browse/ARROW-1576
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> E.g. {{is_integer}}, {{is_unsigned_integer}}. This could be implemented 
> similar to NumPy, too ({{isinstance(t, pa.FloatingPoint)}} or something)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1576) [Python] Add utility functions (or a richer type hierachy) for checking whether data type instances are members of various type classes

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1576:
---

 Summary: [Python] Add utility functions (or a richer type 
hierachy) for checking whether data type instances are members of various type 
classes
 Key: ARROW-1576
 URL: https://issues.apache.org/jira/browse/ARROW-1576
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.8.0


E.g. {{is_integer}}, {{is_unsigned_integer}}. This could be implemented similar 
to NumPy, too ({{isinstance(t, pa.FloatingPoint)}} or something)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1575) [Python] Add pyarrow.column factory function

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1575:
---

 Summary: [Python] Add pyarrow.column factory function
 Key: ARROW-1575
 URL: https://issues.apache.org/jira/browse/ARROW-1575
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.8.0


This would internally call {{Column.from_array}} as appropriate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1500) [C++] Result of ftruncate ignored in MemoryMappedFile::Create

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172525#comment-16172525
 ] 

ASF GitHub Bot commented on ARROW-1500:
---

Github user amirma commented on the issue:

https://github.com/apache/arrow/pull/1116
  
@wesm Bah, I just noticed my patch has a bug; if truncate fails we will 
leak the file handle. I just resubmitted a fixed version. Thanks.


> [C++] Result of ftruncate ignored in MemoryMappedFile::Create
> -
>
> Key: ARROW-1500
> URL: https://issues.apache.org/jira/browse/ARROW-1500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Amir Malekpour
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Observed in gcc 5.4.0 release build



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1574) [C++] Implement kernel function that converts a dense array to dictionary given known dictionary

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1574:
---

 Summary: [C++] Implement kernel function that converts a dense 
array to dictionary given known dictionary
 Key: ARROW-1574
 URL: https://issues.apache.org/jira/browse/ARROW-1574
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This may simply be a special case of cast using a dictionary type



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1573) [C++] Implement stateful kernel function that uses DictionaryBuilder to compute dictionary indices

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1573:
---

 Summary: [C++] Implement stateful kernel function that uses 
DictionaryBuilder to compute dictionary indices
 Key: ARROW-1573
 URL: https://issues.apache.org/jira/browse/ARROW-1573
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


An operator utilizing this kernel may need some way to indicate to 
multithreaded schedulers that it cannot be parallelized on chunked arrays 
(unless we implement a concurrent hash table)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1572) [C++] Implement "value counts" kernels for tabulating value frequencies

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1572:
---

 Summary: [C++] Implement "value counts" kernels for tabulating 
value frequencies
 Key: ARROW-1572
 URL: https://issues.apache.org/jira/browse/ARROW-1572
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This is related to "match", "isin", and "unique" since hashing is generally 
required



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1571) [C++] Implement argsort kernels (sort indices) for integers using O(n) counting sort

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1571:
---

 Summary: [C++] Implement argsort kernels (sort indices) for 
integers using O(n) counting sort
 Key: ARROW-1571
 URL: https://issues.apache.org/jira/browse/ARROW-1571
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This function requires knowledge of the minimum and maximum of an array. If it 
is small enough, then an array of size {{maximum - minimum}} can be constructed 
and used to tabulate value frequencies and then compute the sort indices (this 
is called "grade up" or "grade down" in APL languages). There is generally a 
cross-over point where this function performs worse than mergesort or quicksort 
due to data locality issues



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1564) [C++] Kernel functions for computing minimum and maximum of an array in one pass

2017-09-19 Thread Amir Malekpour (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Malekpour updated ARROW-1564:
--
Description: This is useful for determining whether a small-range integer 
O( n ) sort can be used in some circumstances. Can also be used for simply 
computing array statistics  (was: This is useful for determining whether a 
small-range integer O( n ) sort can be used in some circumstances. Can also be 
use for simply computing array statistics)

> [C++] Kernel functions for computing minimum and maximum of an array in one 
> pass
> 
>
> Key: ARROW-1564
> URL: https://issues.apache.org/jira/browse/ARROW-1564
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
>
> This is useful for determining whether a small-range integer O( n ) sort can 
> be used in some circumstances. Can also be used for simply computing array 
> statistics



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1570) [C++] Define API for creating a kernel instance from function of scalar input and output with a particular signature

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1570:
---

 Summary: [C++] Define API for creating a kernel instance from 
function of scalar input and output with a particular signature
 Key: ARROW-1570
 URL: https://issues.apache.org/jira/browse/ARROW-1570
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This could include an {{std::function}} instance (but these cannot be inlined 
by the C++ compiler), but should also permit use with inline-able functions or 
functors



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-772) [C++] Implement take kernel functions

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-772:
---
Summary: [C++] Implement take kernel functions  (was: [C++] Implement Take 
function for arrow::Array types)

> [C++] Implement take kernel functions
> -
>
> Key: ARROW-772
> URL: https://issues.apache.org/jira/browse/ARROW-772
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 0.8.0
>
>
> Among other things, this can be used to convert from DictionaryArray back to 
> dense array. This is equivalent to {{ndarray.take}} in NumPy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1569) [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1569:
---

 Summary: [C++] Kernel functions for determining monotonicity 
(ascending or descending) for well-ordered types
 Key: ARROW-1569
 URL: https://issues.apache.org/jira/browse/ARROW-1569
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


These kernels must offer some stateful variant so that monotonicity can be 
determined across chunked arrays



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1568) [C++] Implement "drop null" kernels that return array without nulls

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1568:
---

 Summary: [C++] Implement "drop null" kernels that return array 
without nulls
 Key: ARROW-1568
 URL: https://issues.apache.org/jira/browse/ARROW-1568
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1567) [C++] Implement "fill null" kernels that replace null values with some scalar replacement value

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1567:
---

 Summary: [C++] Implement "fill null" kernels that replace null 
values with some scalar replacement value
 Key: ARROW-1567
 URL: https://issues.apache.org/jira/browse/ARROW-1567
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1566) [C++] Implement "argsort" kernels that use mergesort to compute sorting indices

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1566:
---

 Summary: [C++] Implement "argsort" kernels that use mergesort to 
compute sorting indices
 Key: ARROW-1566
 URL: https://issues.apache.org/jira/browse/ARROW-1566
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1565) [C++] "argtopk" and "argbottomk" functions for computing indices of largest or smallest elements

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1565:
---

 Summary: [C++] "argtopk" and "argbottomk" functions for computing 
indices of largest or smallest elements
 Key: ARROW-1565
 URL: https://issues.apache.org/jira/browse/ARROW-1565
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


Heap-based topk can compute these indices in O(n log k) time



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1564) [C++] Kernel functions for computing minimum and maximum of an array in one pass

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1564:

Description: This is useful for determining whether a small-range integer 
O(n) sort can be used in some circumstances. Can also be use for simply 
computing array statistics

> [C++] Kernel functions for computing minimum and maximum of an array in one 
> pass
> 
>
> Key: ARROW-1564
> URL: https://issues.apache.org/jira/browse/ARROW-1564
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
>
> This is useful for determining whether a small-range integer O(n) sort can be 
> used in some circumstances. Can also be use for simply computing array 
> statistics



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1564) [C++] Kernel functions for computing minimum and maximum of an array in one pass

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1564:

Description: This is useful for determining whether a small-range integer 
O( n ) sort can be used in some circumstances. Can also be use for simply 
computing array statistics  (was: This is useful for determining whether a 
small-range integer O(n) sort can be used in some circumstances. Can also be 
use for simply computing array statistics)

> [C++] Kernel functions for computing minimum and maximum of an array in one 
> pass
> 
>
> Key: ARROW-1564
> URL: https://issues.apache.org/jira/browse/ARROW-1564
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
>
> This is useful for determining whether a small-range integer O( n ) sort can 
> be used in some circumstances. Can also be use for simply computing array 
> statistics



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1564) [C++] Kernel functions for computing minimum and maximum of an array in one pass

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1564:
---

 Summary: [C++] Kernel functions for computing minimum and maximum 
of an array in one pass
 Key: ARROW-1564
 URL: https://issues.apache.org/jira/browse/ARROW-1564
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1500) [C++] Result of ftruncate ignored in MemoryMappedFile::Create

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172433#comment-16172433
 ] 

ASF GitHub Bot commented on ARROW-1500:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1116
  
Thanks, the Travis CI tubes are a bit clogged today so I may not be able to 
merge until later tonight or tomorrow morning


> [C++] Result of ftruncate ignored in MemoryMappedFile::Create
> -
>
> Key: ARROW-1500
> URL: https://issues.apache.org/jira/browse/ARROW-1500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Amir Malekpour
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Observed in gcc 5.4.0 release build



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1500) [C++] Result of ftruncate ignored in MemoryMappedFile::Create

2017-09-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1500:
--
Labels: pull-request-available  (was: )

> [C++] Result of ftruncate ignored in MemoryMappedFile::Create
> -
>
> Key: ARROW-1500
> URL: https://issues.apache.org/jira/browse/ARROW-1500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Amir Malekpour
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Observed in gcc 5.4.0 release build



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1563) [C++] Implement logical unary and binary kernels for boolean arrays

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1563:
---

 Summary: [C++] Implement logical unary and binary kernels for 
boolean arrays
 Key: ARROW-1563
 URL: https://issues.apache.org/jira/browse/ARROW-1563
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


And, or, not (negate), xor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1562) [C++] Numeric kernel implementations for add (+)

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1562:
---

 Summary: [C++] Numeric kernel implementations for add (+)
 Key: ARROW-1562
 URL: https://issues.apache.org/jira/browse/ARROW-1562
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This function should respect consistent type promotions between types of 
different sizes and signed and unsigned integers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1561) [C++] Kernel implementations for "isin" (set containment)

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1561:
---

 Summary: [C++] Kernel implementations for "isin" (set containment)
 Key: ARROW-1561
 URL: https://issues.apache.org/jira/browse/ARROW-1561
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


isin determines whether each element in the left array is contained in the 
values in the right array. This function must handle the case where the right 
array has nulls (so that null in the left array will return true)

{code}
isin(['a', 'b', null], ['a', 'c'])
returns [true, false, null]

isin(['a', 'b', null], ['a', 'c', null])
returns [true, false, true]
{code}

May need an option to return false for null instead of null



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1560) [C++] Kernel implementations for "match" function

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1560:
---

 Summary: [C++] Kernel implementations for "match" function
 Key: ARROW-1560
 URL: https://issues.apache.org/jira/browse/ARROW-1560
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


Match computes a position index array from an array values into a set of 
categories

{code}
match(['a', 'b', 'a', null, 'b', 'a', 'b'], ['b', 'a'])

return [1, 0, 1, null, 0, 1, 0]
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1559:
---

 Summary: [C++] Kernel implementations for "unique" (compute 
distinct elements of array)
 Key: ARROW-1559
 URL: https://issues.apache.org/jira/browse/ARROW-1559
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1558) [C++] Implement boolean selection kernels

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1558:
---

 Summary: [C++] Implement boolean selection kernels
 Key: ARROW-1558
 URL: https://issues.apache.org/jira/browse/ARROW-1558
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


Select values where a boolean selection array is true. If any values in are 
null, then values in the output array should be null



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1553) [JAVA] Implement setInitialCapacity for MapWriter and pass on this capacity during lazy creation of child vectors

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172344#comment-16172344
 ] 

ASF GitHub Bot commented on ARROW-1553:
---

Github user siddharthteotia commented on the issue:

https://github.com/apache/arrow/pull/1113
  
Added unit test


> [JAVA] Implement setInitialCapacity for MapWriter and pass on this capacity 
> during lazy creation of child vectors
> -
>
> Key: ARROW-1553
> URL: https://issues.apache.org/jira/browse/ARROW-1553
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1534) [C++] Decimal128::ToBytes and uint8_t* constructor should return/assume big-endian byte order

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172305#comment-16172305
 ] 

ASF GitHub Bot commented on ARROW-1534:
---

Github user cpcloud commented on the issue:

https://github.com/apache/arrow/pull/1108
  
Closing until we resolve the way forward with parquet-cpp and decimals.


> [C++] Decimal128::ToBytes and uint8_t* constructor should return/assume 
> big-endian byte order
> -
>
> Key: ARROW-1534
> URL: https://issues.apache.org/jira/browse/ARROW-1534
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.6.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1534) [C++] Decimal128::ToBytes and uint8_t* constructor should return/assume big-endian byte order

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172307#comment-16172307
 ] 

ASF GitHub Bot commented on ARROW-1534:
---

Github user cpcloud closed the pull request at:

https://github.com/apache/arrow/pull/1108


> [C++] Decimal128::ToBytes and uint8_t* constructor should return/assume 
> big-endian byte order
> -
>
> Key: ARROW-1534
> URL: https://issues.apache.org/jira/browse/ARROW-1534
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.6.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1500) [C++] Result of ftruncate ignored in MemoryMappedFile::Create

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1500:
---

Assignee: Amir Malekpour

> [C++] Result of ftruncate ignored in MemoryMappedFile::Create
> -
>
> Key: ARROW-1500
> URL: https://issues.apache.org/jira/browse/ARROW-1500
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Amir Malekpour
> Fix For: 0.8.0
>
>
> Observed in gcc 5.4.0 release build



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1555) [Python] write_to_dataset on s3

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1555:

Summary: [Python] write_to_dataset on s3  (was: PyArrow write_to_dataset on 
s3)

> [Python] write_to_dataset on s3
> ---
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Assignee: Florian Jetter
>Priority: Trivial
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1555) PyArrow write_to_dataset on s3

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172142#comment-16172142
 ] 

Wes McKinney commented on ARROW-1555:
-

{{exists}} is a hard one. It may be better to try to fix the implementation of 
{{write_to_dataset}} to not use methods like {{exists}} that are not S3-friendly

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L920

> PyArrow write_to_dataset on s3
> --
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Assignee: Florian Jetter
>Priority: Trivial
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1555) PyArrow write_to_dataset on s3

2017-09-19 Thread Florian Jetter (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172076#comment-16172076
 ] 

Florian Jetter commented on ARROW-1555:
---

[~wesmckinn] Yes, it seems like some abstract methods of the FileSystem class 
(exists, open, etc.)  were not implemented in the wrapper. I'll take care of it

> PyArrow write_to_dataset on s3
> --
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Priority: Trivial
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1555) PyArrow write_to_dataset on s3

2017-09-19 Thread Florian Jetter (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Jetter reassigned ARROW-1555:
-

Assignee: Florian Jetter

> PyArrow write_to_dataset on s3
> --
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Assignee: Florian Jetter
>Priority: Trivial
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1554:
--
Labels: pull-request-available  (was: )

> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
>  Labels: pull-request-available
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1554:
---

Assignee: Wes McKinney

> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172053#comment-16172053
 ] 

ASF GitHub Bot commented on ARROW-1554:
---

GitHub user wesm opened a pull request:

https://github.com/apache/arrow/pull/1115

ARROW-1554: [Python] Update Sphinx install page to note that VC14 runtime 
may need to be installed on Windows



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wesm/arrow ARROW-1554

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/arrow/pull/1115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1115


commit a7c3e2795b5dc326d15b06b483283afa29a03ed7
Author: Wes McKinney 
Date:   2017-09-19T17:29:01Z

Update Sphinx install page to note that VC14 runtime may need to be 
installed separately when using pip on Windows

Change-Id: I3d0ba98091d5d59a81f528a07740bcc405848287




> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
>  Labels: pull-request-available
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1557:

Description: 
pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
{{names}} matches. I think this should raise with a {{ValueError}}:

{code}
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 
'b', 'c'])
Out[2]:
pyarrow.Table
a: int64
b: int64

In [3]: pa.__version__
Out[3]: '0.7.0'
{code}

(This is my first time using JIRA, hopefully I didn't mess up too badly)

  was:
pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
{{names}} matches. I think this should raise with a {{ValueError}}:

{{
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 
'b', 'c'])
Out[2]:
pyarrow.Table
a: int64
b: int64

In [3]: pa.__version__
Out[3]: '0.7.0'
}}

(This is my first time using JIRA, hopefully I didn't mess up too badly)


> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {code}
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> {code}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172019#comment-16172019
 ] 

Wes McKinney commented on ARROW-1557:
-

Agreed! thanks for the bug report

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {code}
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> {code}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1557:

Fix Version/s: 0.8.0

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
> Fix For: 0.8.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {{
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> }}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172007#comment-16172007
 ] 

Tom Augspurger commented on ARROW-1557:
---

I can probably submit a fix on Thursday or Friday.

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {{
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> }}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-1557:
--
Summary: [PYTHON] pyarrow.Table.from_arrays doesn't validate names length  
(was: pyarrow.Table.from_arrays doesn't validate names length)

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {{
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> }}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1557) pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1557:
-

 Summary: pyarrow.Table.from_arrays doesn't validate names length
 Key: ARROW-1557
 URL: https://issues.apache.org/jira/browse/ARROW-1557
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor


pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
{{names}} matches. I think this should raise with a {{ValueError}}:

{{
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 
'b', 'c'])
Out[2]:
pyarrow.Table
a: int64
b: int64

In [3]: pa.__version__
Out[3]: '0.7.0'
}}

(This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1556) [C++] Incorporate AssertArraysEqual function from PARQUET-1100 patch

2017-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1556:
---

 Summary: [C++] Incorporate AssertArraysEqual function from 
PARQUET-1100 patch
 Key: ARROW-1556
 URL: https://issues.apache.org/jira/browse/ARROW-1556
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.8.0


see discussion in https://github.com/apache/parquet-cpp/pull/398



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1347) [JAVA] List null type should use consistent name for inner field

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171974#comment-16171974
 ] 

ASF GitHub Bot commented on ARROW-1347:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/959
  
Can this be merged?


> [JAVA] List null type should use consistent name for inner field
> 
>
> Key: ARROW-1347
> URL: https://issues.apache.org/jira/browse/ARROW-1347
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>  Labels: pull-request-available
>
> The child field for List type has the field name "$data$" in most cases. In 
> the case that there is not a known type for the List, currently the 
> getField() method will return a subfield with name "DEFAULT". We should make 
> this consistent with the rest of the cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1192) [JAVA] Improve splitAndTransfer performance for List and Union vectors

2017-09-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1192:
--
Labels: pull-request-available  (was: )

> [JAVA] Improve splitAndTransfer performance for List and Union vectors
> --
>
> Key: ARROW-1192
> URL: https://issues.apache.org/jira/browse/ARROW-1192
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Most vector implementations slice the underlying buffer for splitAndTransfer, 
> but ListVector and UnionVector copy data into a new buffer. We should enhance 
> these to use slice as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1192) [JAVA] Improve splitAndTransfer performance for List and Union vectors

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171971#comment-16171971
 ] 

ASF GitHub Bot commented on ARROW-1192:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/819
  
@StevenMPhillips can you close? 


> [JAVA] Improve splitAndTransfer performance for List and Union vectors
> --
>
> Key: ARROW-1192
> URL: https://issues.apache.org/jira/browse/ARROW-1192
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Most vector implementations slice the underlying buffer for splitAndTransfer, 
> but ListVector and UnionVector copy data into a new buffer. We should enhance 
> these to use slice as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1347) [JAVA] List null type should use consistent name for inner field

2017-09-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1347:
--
Labels: pull-request-available  (was: )

> [JAVA] List null type should use consistent name for inner field
> 
>
> Key: ARROW-1347
> URL: https://issues.apache.org/jira/browse/ARROW-1347
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>  Labels: pull-request-available
>
> The child field for List type has the field name "$data$" in most cases. In 
> the case that there is not a known type for the List, currently the 
> getField() method will return a subfield with name "DEFAULT". We should make 
> this consistent with the rest of the cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171969#comment-16171969
 ] 

Wes McKinney commented on ARROW-1554:
-

Cool. I changed the JIRA title so that we can add a note to the Sphinx docs 
about this issue

> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1554) [Python] Document that pip wheels depend on MSVC14 runtime

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1554:

Summary: [Python] Document that pip wheels depend on MSVC14 runtime  (was: 
"ImportError: DLL load failed: The specified module could not be found" on 
Windows 10)

> [Python] Document that pip wheels depend on MSVC14 runtime
> --
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1547) [JAVA] Fix 8x memory over-allocation in BitVector

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1547.
-
Resolution: Fixed

Issue resolved by pull request 1109
[https://github.com/apache/arrow/pull/1109]

> [JAVA] Fix 8x memory over-allocation in BitVector
> -
>
> Key: ARROW-1547
> URL: https://issues.apache.org/jira/browse/ARROW-1547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> Typically there are 3 ways of specifying the amount of memory needed for 
> vectors.
> CASE (1) allocateNew() -- here the application doesn't really specify the 
> size of memory or value count. Each vector type has a default value count 
> (4096) and therefore a default size (in bytes) is used in such cases.
> For example, for a 4 byte fixed-width vector, we will allocate 32KB of memory 
> for a call to allocateNew().
> CASE (2) setInitialCapacity(count) followed by allocateNew() - In this case 
> also the application doesn't specify the value count or size in 
> allocateNew(). However, the call to setInitialCapacity() dictates the amount 
> of memory the subsequent call to allocateNew() will allocate.
> For example, we can do setInitialCapacity(1024) and the call to allocateNew() 
> will allocate 4KB of memory for the 4 byte fixed-width vector.
> CASE (3) allocateNew(count) - The application is specific about requirements.
> For nullable vectors, the above calls also allocate the memory for validity 
> vector.
> The problem is that Bit Vector uses a default memory size in bytes of 4096. 
> In other words, we allocate a vector for 4096*8 value count.
> In the default case (as explained above), the vector types have a value count 
> of 4096 so we need only 4096 bits (512 bytes) in the bit vector and not 
> really 4096 as the size in bytes.
> This happens in CASE 1 where the application depends on the default memory 
> allocation . In such cases, the size of buffer for bit vector is 8x than 
> actually needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1547) [JAVA] Fix 8x memory over-allocation in BitVector

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171965#comment-16171965
 ] 

ASF GitHub Bot commented on ARROW-1547:
---

Github user asfgit closed the pull request at:

https://github.com/apache/arrow/pull/1109


> [JAVA] Fix 8x memory over-allocation in BitVector
> -
>
> Key: ARROW-1547
> URL: https://issues.apache.org/jira/browse/ARROW-1547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> Typically there are 3 ways of specifying the amount of memory needed for 
> vectors.
> CASE (1) allocateNew() -- here the application doesn't really specify the 
> size of memory or value count. Each vector type has a default value count 
> (4096) and therefore a default size (in bytes) is used in such cases.
> For example, for a 4 byte fixed-width vector, we will allocate 32KB of memory 
> for a call to allocateNew().
> CASE (2) setInitialCapacity(count) followed by allocateNew() - In this case 
> also the application doesn't specify the value count or size in 
> allocateNew(). However, the call to setInitialCapacity() dictates the amount 
> of memory the subsequent call to allocateNew() will allocate.
> For example, we can do setInitialCapacity(1024) and the call to allocateNew() 
> will allocate 4KB of memory for the 4 byte fixed-width vector.
> CASE (3) allocateNew(count) - The application is specific about requirements.
> For nullable vectors, the above calls also allocate the memory for validity 
> vector.
> The problem is that Bit Vector uses a default memory size in bytes of 4096. 
> In other words, we allocate a vector for 4096*8 value count.
> In the default case (as explained above), the vector types have a value count 
> of 4096 so we need only 4096 bits (512 bytes) in the bit vector and not 
> really 4096 as the size in bytes.
> This happens in CASE 1 where the application depends on the default memory 
> allocation . In such cases, the size of buffer for bit vector is 8x than 
> actually needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Dima Ryazanov (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171966#comment-16171966
 ] 

Dima Ryazanov commented on ARROW-1554:
--

Yep, installing the Visual Studio C++ Redistributable fixed the problem. 
(Though that answer says 2015 - but points to the 2010 one. Also, appears to be 
x86 only. I installed this one: 
https://www.microsoft.com/en-us/download/details.aspx?id=48145)

(Haven't actually tried conda yet - but I tried it before in a different 
environment, and I see "Miniconda3/Library/bin/msvcp140.dll" there - so makes 
sense that it works.)

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1547) [JAVA] Fix 8x memory over-allocation in BitVector

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171963#comment-16171963
 ] 

ASF GitHub Bot commented on ARROW-1547:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1109
  
+1


> [JAVA] Fix 8x memory over-allocation in BitVector
> -
>
> Key: ARROW-1547
> URL: https://issues.apache.org/jira/browse/ARROW-1547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> Typically there are 3 ways of specifying the amount of memory needed for 
> vectors.
> CASE (1) allocateNew() -- here the application doesn't really specify the 
> size of memory or value count. Each vector type has a default value count 
> (4096) and therefore a default size (in bytes) is used in such cases.
> For example, for a 4 byte fixed-width vector, we will allocate 32KB of memory 
> for a call to allocateNew().
> CASE (2) setInitialCapacity(count) followed by allocateNew() - In this case 
> also the application doesn't specify the value count or size in 
> allocateNew(). However, the call to setInitialCapacity() dictates the amount 
> of memory the subsequent call to allocateNew() will allocate.
> For example, we can do setInitialCapacity(1024) and the call to allocateNew() 
> will allocate 4KB of memory for the 4 byte fixed-width vector.
> CASE (3) allocateNew(count) - The application is specific about requirements.
> For nullable vectors, the above calls also allocate the memory for validity 
> vector.
> The problem is that Bit Vector uses a default memory size in bytes of 4096. 
> In other words, we allocate a vector for 4096*8 value count.
> In the default case (as explained above), the vector types have a value count 
> of 4096 so we need only 4096 bits (512 bytes) in the bit vector and not 
> really 4096 as the size in bytes.
> This happens in CASE 1 where the application depends on the default memory 
> allocation . In such cases, the size of buffer for bit vector is 8x than 
> actually needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1533) [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171957#comment-16171957
 ] 

ASF GitHub Bot commented on ARROW-1533:
---

Github user asfgit closed the pull request at:

https://github.com/apache/arrow/pull/1112


> [JAVA] realloc should consider the existing buffer capacity for computing 
> target memory requirement
> ---
>
> Key: ARROW-1533
> URL: https://issues.apache.org/jira/browse/ARROW-1533
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> We recently encountered a problem when we were trying to add JSON files with 
> complex schema as datasets.
> Initially we started with a Float8Vector with default memory allocation of 
> (4096 * 8) 32KB.
> Went through several iterations of setSafe() to trigger a realloc() from 32KB 
> to 64KB.
> Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB
> After that we encountered a BigInt and promoted our vector to UnionVector.
> This required us to create a UnionVector with BigIntVector and Float8Vector. 
> The latter required us to transfer the Float8Vector we were earlier working 
> with to the Float8Vector inside the Union.
> As part of transferTo(), the target Float8Vector got all the ArrowBuf state 
> (capacity, buffer contents) etc transferred from the source vector.
> Later, a realloc was triggered on the Float8Vector inside the UnionVector.
> The computation inside realloc() to determine the amount of memory to be 
> reallocated goes wrong since it makes the decision based on 
> allocateSizeInBytes -- although this vector was created as part of transfer() 
> from 128KB source vector, allocateSizeInBytes is still at the initial/default 
> value of 32KB
> We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and 
> seg fault when invoking setBytes().
> There is a wrong assumption in realloc() that allocateSizeInBytes is always 
> equal to data.capacity(). The particular scenario described above exposes 
> where this assumption could go wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1533) [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1533.
-
Resolution: Fixed

Issue resolved by pull request 1112
[https://github.com/apache/arrow/pull/1112]

> [JAVA] realloc should consider the existing buffer capacity for computing 
> target memory requirement
> ---
>
> Key: ARROW-1533
> URL: https://issues.apache.org/jira/browse/ARROW-1533
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> We recently encountered a problem when we were trying to add JSON files with 
> complex schema as datasets.
> Initially we started with a Float8Vector with default memory allocation of 
> (4096 * 8) 32KB.
> Went through several iterations of setSafe() to trigger a realloc() from 32KB 
> to 64KB.
> Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB
> After that we encountered a BigInt and promoted our vector to UnionVector.
> This required us to create a UnionVector with BigIntVector and Float8Vector. 
> The latter required us to transfer the Float8Vector we were earlier working 
> with to the Float8Vector inside the Union.
> As part of transferTo(), the target Float8Vector got all the ArrowBuf state 
> (capacity, buffer contents) etc transferred from the source vector.
> Later, a realloc was triggered on the Float8Vector inside the UnionVector.
> The computation inside realloc() to determine the amount of memory to be 
> reallocated goes wrong since it makes the decision based on 
> allocateSizeInBytes -- although this vector was created as part of transfer() 
> from 128KB source vector, allocateSizeInBytes is still at the initial/default 
> value of 32KB
> We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and 
> seg fault when invoking setBytes().
> There is a wrong assumption in realloc() that allocateSizeInBytes is always 
> equal to data.capacity(). The particular scenario described above exposes 
> where this assumption could go wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1533) [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171955#comment-16171955
 ] 

ASF GitHub Bot commented on ARROW-1533:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1112
  
+1


> [JAVA] realloc should consider the existing buffer capacity for computing 
> target memory requirement
> ---
>
> Key: ARROW-1533
> URL: https://issues.apache.org/jira/browse/ARROW-1533
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> We recently encountered a problem when we were trying to add JSON files with 
> complex schema as datasets.
> Initially we started with a Float8Vector with default memory allocation of 
> (4096 * 8) 32KB.
> Went through several iterations of setSafe() to trigger a realloc() from 32KB 
> to 64KB.
> Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB
> After that we encountered a BigInt and promoted our vector to UnionVector.
> This required us to create a UnionVector with BigIntVector and Float8Vector. 
> The latter required us to transfer the Float8Vector we were earlier working 
> with to the Float8Vector inside the Union.
> As part of transferTo(), the target Float8Vector got all the ArrowBuf state 
> (capacity, buffer contents) etc transferred from the source vector.
> Later, a realloc was triggered on the Float8Vector inside the UnionVector.
> The computation inside realloc() to determine the amount of memory to be 
> reallocated goes wrong since it makes the decision based on 
> allocateSizeInBytes -- although this vector was created as part of transfer() 
> from 128KB source vector, allocateSizeInBytes is still at the initial/default 
> value of 32KB
> We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and 
> seg fault when invoking setBytes().
> There is a wrong assumption in realloc() that allocateSizeInBytes is always 
> equal to data.capacity(). The particular scenario described above exposes 
> where this assumption could go wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171922#comment-16171922
 ] 

Wes McKinney commented on ARROW-1554:
-

According to 
https://answers.microsoft.com/en-us/windows/forum/windows_10-performance/msvcp140dll-is-missing-in-my-win-10/1c65d6b0-68b8-4b59-b720-3e6a33038389?auth=1
 you may be able to resolve this by installing Visual C++ Redistributable on 
your machine, which will install the VC14 runtime. 

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1554:

Attachment: parquet_dependencies.png

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171921#comment-16171921
 ] 

Wes McKinney commented on ARROW-1554:
-

OK, yeah, I used dependency walker and see that also, attaching screenshot

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: parquet_dependencies.png, Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Dima Ryazanov (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171913#comment-16171913
 ] 

Dima Ryazanov commented on ARROW-1554:
--

Looks like it's missing MSVCP140.dll - see the screenshot.

And you're right, tensorflow is also failing.

I'll try conda next.

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Dima Ryazanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dima Ryazanov updated ARROW-1554:
-
Attachment: Process Monitor.png

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
> Attachments: Process Monitor.png
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171905#comment-16171905
 ] 

Wes McKinney commented on ARROW-1554:
-

Are you able to install and use tensorflow from pip on your machine? 
https://pypi.python.org/pypi/tensorflow 

That's a very similar build toolchain to ours

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171907#comment-16171907
 ] 

Wes McKinney commented on ARROW-1554:
-

cc [~Max Risuhin]

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171897#comment-16171897
 ] 

Wes McKinney commented on ARROW-1554:
-

I see. Are you able to install with conda instead? That's going to be a much 
more reliable / robust environment all around for Windows users. It also 
installs dependencies like different MSVC runtimes.

If you or anyone knows a tool to figure out what DLL dependency is missing 
(based on what we've discussed, it suggests that it's missing symbols _outside_ 
pyarrow, like something in the MSVC runtime), that would be really helpful. 

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Dima Ryazanov (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171874#comment-16171874
 ] 

Dima Ryazanov edited comment on ARROW-1554 at 9/19/17 3:17 PM:
---

Yes, using pip. I've tried 0.5.0, 0.6.0, and 0.7.0 - and it's all the same.

I just did a "pip uninstall pyarrow"; it failed cause I actually had some files 
open, so I then manually deleted the ...\site-packages\pyarrow dir, then 
installed pyarrow again. Same thing.


was (Author: dimaryaz):
Yes, using pip. I've tried 0.5.0, 0.6.0, and 0.7.0 - and it's all the same.

I just did a {code}pip uninstall pyarrow{code}; it failed cause I actually had 
some files open, so I then manually deleted the ...\site-packages\pyarrow dir, 
then installed pyarrow again. Same thing.

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1533) [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171876#comment-16171876
 ] 

ASF GitHub Bot commented on ARROW-1533:
---

Github user icexelloss commented on the issue:

https://github.com/apache/arrow/pull/1112
  
LGTM too.


> [JAVA] realloc should consider the existing buffer capacity for computing 
> target memory requirement
> ---
>
> Key: ARROW-1533
> URL: https://issues.apache.org/jira/browse/ARROW-1533
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> We recently encountered a problem when we were trying to add JSON files with 
> complex schema as datasets.
> Initially we started with a Float8Vector with default memory allocation of 
> (4096 * 8) 32KB.
> Went through several iterations of setSafe() to trigger a realloc() from 32KB 
> to 64KB.
> Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB
> After that we encountered a BigInt and promoted our vector to UnionVector.
> This required us to create a UnionVector with BigIntVector and Float8Vector. 
> The latter required us to transfer the Float8Vector we were earlier working 
> with to the Float8Vector inside the Union.
> As part of transferTo(), the target Float8Vector got all the ArrowBuf state 
> (capacity, buffer contents) etc transferred from the source vector.
> Later, a realloc was triggered on the Float8Vector inside the UnionVector.
> The computation inside realloc() to determine the amount of memory to be 
> reallocated goes wrong since it makes the decision based on 
> allocateSizeInBytes -- although this vector was created as part of transfer() 
> from 128KB source vector, allocateSizeInBytes is still at the initial/default 
> value of 32KB
> We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and 
> seg fault when invoking setBytes().
> There is a wrong assumption in realloc() that allocateSizeInBytes is always 
> equal to data.capacity(). The particular scenario described above exposes 
> where this assumption could go wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Dima Ryazanov (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171874#comment-16171874
 ] 

Dima Ryazanov commented on ARROW-1554:
--

Yes, using pip. I've tried 0.5.0, 0.6.0, and 0.7.0 - and it's all the same.

I just did a {code}pip uninstall pyarrow{code}; it failed cause I actually had 
some files open, so I then manually deleted the ...\site-packages\pyarrow dir, 
then installed pyarrow again. Same thing.

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1553) [JAVA] Implement setInitialCapacity for MapWriter and pass on this capacity during lazy creation of child vectors

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171816#comment-16171816
 ] 

ASF GitHub Bot commented on ARROW-1553:
---

Github user jacques-n commented on the issue:

https://github.com/apache/arrow/pull/1113
  
LGTM. Definitely helps our use case. Agree with @icexelloss that we should 
add a test as well for this situation.


> [JAVA] Implement setInitialCapacity for MapWriter and pass on this capacity 
> during lazy creation of child vectors
> -
>
> Key: ARROW-1553
> URL: https://issues.apache.org/jira/browse/ARROW-1553
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1533) [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171812#comment-16171812
 ] 

ASF GitHub Bot commented on ARROW-1533:
---

Github user jacques-n commented on the issue:

https://github.com/apache/arrow/pull/1112
  
Good additional questions that we should address in ARROW-1463. +1 on 
getting this merged.


> [JAVA] realloc should consider the existing buffer capacity for computing 
> target memory requirement
> ---
>
> Key: ARROW-1533
> URL: https://issues.apache.org/jira/browse/ARROW-1533
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> We recently encountered a problem when we were trying to add JSON files with 
> complex schema as datasets.
> Initially we started with a Float8Vector with default memory allocation of 
> (4096 * 8) 32KB.
> Went through several iterations of setSafe() to trigger a realloc() from 32KB 
> to 64KB.
> Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB
> After that we encountered a BigInt and promoted our vector to UnionVector.
> This required us to create a UnionVector with BigIntVector and Float8Vector. 
> The latter required us to transfer the Float8Vector we were earlier working 
> with to the Float8Vector inside the Union.
> As part of transferTo(), the target Float8Vector got all the ArrowBuf state 
> (capacity, buffer contents) etc transferred from the source vector.
> Later, a realloc was triggered on the Float8Vector inside the UnionVector.
> The computation inside realloc() to determine the amount of memory to be 
> reallocated goes wrong since it makes the decision based on 
> allocateSizeInBytes -- although this vector was created as part of transfer() 
> from 128KB source vector, allocateSizeInBytes is still at the initial/default 
> value of 32KB
> We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and 
> seg fault when invoking setBytes().
> There is a wrong assumption in realloc() that allocateSizeInBytes is always 
> equal to data.capacity(). The particular scenario described above exposes 
> where this assumption could go wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1538) [C++] Support Ubuntu 14.04 in .deb packaging automation

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171762#comment-16171762
 ] 

Wes McKinney commented on ARROW-1538:
-

hi [~rvernica] you need ARROW-1546 
https://github.com/apache/arrow/commit/bfe657909f5e7d96b7b8e5179baa17044b6ea375

> [C++] Support Ubuntu 14.04 in .deb packaging automation
> ---
>
> Key: ARROW-1538
> URL: https://issues.apache.org/jira/browse/ARROW-1538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Packaging
>Reporter: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1209) [C++] Implement converter between Arrow record batches and Avro records

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171759#comment-16171759
 ] 

ASF GitHub Bot commented on ARROW-1209:
---

Github user wesm commented on the issue:

https://github.com/apache/arrow/pull/1026
  
Wow, providing a `FILE*`! That is incredibly restrictive. I will have to 
poke around at the C implementation and also look in other Avro users like 
Impala 
https://github.com/apache/incubator-impala/blob/master/be/src/exec/hdfs-avro-scanner.cc


> [C++] Implement converter between Arrow record batches and Avro records
> ---
>
> Key: ARROW-1209
> URL: https://issues.apache.org/jira/browse/ARROW-1209
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: pull-request-available
>
> This would be useful for streaming systems that need to consume or produce 
> Avro in C/C++



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1555) PyArrow write_to_dataset on s3

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171755#comment-16171755
 ] 

Wes McKinney commented on ARROW-1555:
-

cc [~fjetter]

This may not be too hard to fix -- I don't think that 
{{parquet.write_to_dataset}} has been tested with S3, so a patch to make this 
S3-friendly would be welcome. 

> PyArrow write_to_dataset on s3
> --
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Priority: Trivial
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1555) PyArrow write_to_dataset on s3

2017-09-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1555:

Fix Version/s: 0.8.0

> PyArrow write_to_dataset on s3
> --
>
> Key: ARROW-1555
> URL: https://issues.apache.org/jira/browse/ARROW-1555
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Young-Jun Ko
>Priority: Trivial
> Fix For: 0.8.0
>
>
> When writing a arrow table to s3, I get an NotImplemented Exception.
> The root cause is in _ensure_filesystem and can be reproduced as follows:
> import pyarrow
> import pyarrow.parquet as pqa
> import s3fs
> s3 = s3fs.S3FileSystem()
> pqa._ensure_filesystem(s3).exists("anything")
> It appears that the S3FSWrapper that is instantiated in _ensure_filesystem 
> does not expose the exist method of s3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171680#comment-16171680
 ] 

Wes McKinney commented on ARROW-1554:
-

I just tested the 0.7.0 wheel locally on Windows 10 and it works OK for me. Is 
it possible that you had one of the DLLs open when you updated pyarrow? Maybe 
try removing the directory and reinstalling

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1554) "ImportError: DLL load failed: The specified module could not be found" on Windows 10

2017-09-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171665#comment-16171665
 ] 

Wes McKinney edited comment on ARROW-1554 at 9/19/17 1:31 PM:
--

You installed the wheel with pip is that right? Is it pyarrow 0.7.0? 


was (Author: wesmckinn):
You installed the wheel with pip is that right? Is it pyarrow 0.6.0? 

> "ImportError: DLL load failed: The specified module could not be found" on 
> Windows 10
> -
>
> Key: ARROW-1554
> URL: https://issues.apache.org/jira/browse/ARROW-1554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Windows 10 (x64)
> Python 3.6.2 (x64)
>Reporter: Dima Ryazanov
> Fix For: 0.8.0
>
>
> I just tried pyarrow on Windows 10, and it fails to import for me:
> {code}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\site-packages\pyarrow\__init__.py", 
> line 32, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified module could not be found.
> {code}
> Not sure which DLL is failing, but I do see some DLLs in the pyarrow folder:
> {code}
> C:\Users\dima\Documents>dir "C:\Program 
> Files\Python36\lib\site-packages\pyarrow\"
>  Volume in drive C has no label.
>  Volume Serial Number is 4CE9-CC3C
>  Directory of C:\Program Files\Python36\lib\site-packages\pyarrow
> 09/19/2017  01:14 AM  .
> 09/19/2017  01:14 AM  ..
> 09/19/2017  01:14 AM 2,382,336 arrow.dll
> 09/19/2017  01:14 AM   604,160 arrow_python.dll
> 09/19/2017  01:14 AM 3,402 compat.py
> ...
> {code}
> However, I cannot open them using ctypes.cdll. I wonder if some dependency is 
> missing?
> {code}
> >>> open('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll', 'rb')
> <_io.BufferedReader name='C:\\Program 
> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll'>
> >>>
> >>> cdll.LoadLibrary('C:\\Program 
> >>> Files\\Python36\\Lib\\site-packages\\pyarrow\\parquet.dll')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 426, in 
> LoadLibrary
> return self._dlltype(name)
>   File "C:\Program Files\Python36\lib\ctypes\__init__.py", line 348, in 
> __init__
> self._handle = _dlopen(self._name, mode)
> OSError: [WinError 126] The specified module could not be found
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >