[jira] [Assigned] (ARROW-9688) [C++] Supporting Windows ARM64 builds

2021-10-11 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-9688:
---

Assignee: Niyas

> [C++] Supporting Windows ARM64 builds
> -
>
> Key: ARROW-9688
> URL: https://issues.apache.org/jira/browse/ARROW-9688
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
> Environment: Windows
>Reporter: Mukul Sabharwal
>Assignee: Niyas
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 336h
>  Time Spent: 1h
>  Remaining Estimate: 335h
>
> I was trying to build the Arrow library so I could use it to generate parquet 
> files on Windows ARM64, but it currently fails to compile for a few reasons. 
> I thought I'd enumerate them here, so someone more familiar with the project 
> could spearhead it.
> In SetupCxxFlags.cmake
>  * the MSVC branch for ARROW_CPU_FLAG STREQUAL "x86" is taken even though I'm 
> building ARM64, this may be a more fundamental error somewhere else that 
> needs correction and maybe things would work better, but an inspection of 
> other branches seemed to indicate that ARM64 is assumed to be missing from 
> MSVC and the keywrod "aarch64" (not a term used in the Windows ecosystem) is 
> prevalent in the cmake files. So the first thing I did was I stubbed it out 
> and set SSE42, AVX and AVX512 to be not present
>  * In bit_util.h I provided implementations for popcount32, popcount64 that 
> were not neon accelerated, although neon_cnt is provided by msvc (for n64)
>  * Removed nmintrin.h since that is x64/x64 specific. Note, _BitScanReverse 
> and _BitScanForward are Microsoft specific and support on ARM64.
>  * cpu_info.cc needed tweaks for cpuid stuff, I just returned false and 
> didn't really care too much about any upstream effects. flag_mappings and 
> num_flags ought be defined in the not WIN32 ifdef, since they're not actually 
> used.
> After these changes I was able to remove the vcpkg restriction that 
> artificially failed the library from compiling on arm64 and I was able to 
> successfully compile for both arm64-windows-static and arm64-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9688) [C++] Supporting Windows ARM64 builds

2021-10-11 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-9688:

Summary: [C++] Supporting Windows ARM64 builds  (was: Supporting Windows 
ARM64 builds)

> [C++] Supporting Windows ARM64 builds
> -
>
> Key: ARROW-9688
> URL: https://issues.apache.org/jira/browse/ARROW-9688
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
> Environment: Windows
>Reporter: Mukul Sabharwal
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 336h
>  Time Spent: 50m
>  Remaining Estimate: 335h 10m
>
> I was trying to build the Arrow library so I could use it to generate parquet 
> files on Windows ARM64, but it currently fails to compile for a few reasons. 
> I thought I'd enumerate them here, so someone more familiar with the project 
> could spearhead it.
> In SetupCxxFlags.cmake
>  * the MSVC branch for ARROW_CPU_FLAG STREQUAL "x86" is taken even though I'm 
> building ARM64, this may be a more fundamental error somewhere else that 
> needs correction and maybe things would work better, but an inspection of 
> other branches seemed to indicate that ARM64 is assumed to be missing from 
> MSVC and the keywrod "aarch64" (not a term used in the Windows ecosystem) is 
> prevalent in the cmake files. So the first thing I did was I stubbed it out 
> and set SSE42, AVX and AVX512 to be not present
>  * In bit_util.h I provided implementations for popcount32, popcount64 that 
> were not neon accelerated, although neon_cnt is provided by msvc (for n64)
>  * Removed nmintrin.h since that is x64/x64 specific. Note, _BitScanReverse 
> and _BitScanForward are Microsoft specific and support on ARM64.
>  * cpu_info.cc needed tweaks for cpuid stuff, I just returned false and 
> didn't really care too much about any upstream effects. flag_mappings and 
> num_flags ought be defined in the not WIN32 ifdef, since they're not actually 
> used.
> After these changes I was able to remove the vcpkg restriction that 
> artificially failed the library from compiling on arm64 and I was able to 
> successfully compile for both arm64-windows-static and arm64-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14291) Use CI for linting

2021-10-11 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427443#comment-17427443
 ] 

Kouhei Sutou commented on ARROW-14291:
--

Lint is done by 
https://github.com/apache/arrow/blob/master/.github/workflows/dev.yml#L35 .

> Use CI for linting
> --
>
> Key: ARROW-14291
> URL: https://issues.apache.org/jira/browse/ARROW-14291
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> Currently the development process requires the developer to lint their code 
> before committing it. This can be inefficient for changes made in the browser 
> and when one has a different compiler setup than that used for linting. As 
> described in 
> [https://dev.to/flipp-engineering/linting-only-changed-files-with-github-actions-4ddp]
>  development efficiency can be improved if altered code is automatically 
> linted by a CI action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14291) Use CI for linting

2021-10-11 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427442#comment-17427442
 ] 

Kouhei Sutou edited comment on ARROW-14291 at 10/12/21, 4:34 AM:
-

We already have lint fix by CI feature: 
https://github.com/apache/arrow/blob/master/.github/workflows/comment_bot.yml#L52

Our C++ lint targets are only {{cpp/src/}}: 
https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L243

They don't include {{cpp/examples}}.


was (Author: kou):
We already have this feature: 
https://github.com/apache/arrow/blob/master/.github/workflows/comment_bot.yml#L52

Our C++ lint targets are only {{cpp/src/}}: 
https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L243

They don't include {{cpp/examples}}.

> Use CI for linting
> --
>
> Key: ARROW-14291
> URL: https://issues.apache.org/jira/browse/ARROW-14291
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> Currently the development process requires the developer to lint their code 
> before committing it. This can be inefficient for changes made in the browser 
> and when one has a different compiler setup than that used for linting. As 
> described in 
> [https://dev.to/flipp-engineering/linting-only-changed-files-with-github-actions-4ddp]
>  development efficiency can be improved if altered code is automatically 
> linted by a CI action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14291) Use CI for linting

2021-10-11 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427442#comment-17427442
 ] 

Kouhei Sutou commented on ARROW-14291:
--

We already have this feature: 
https://github.com/apache/arrow/blob/master/.github/workflows/comment_bot.yml#L52

Our C++ lint targets are only {{cpp/src/}}: 
https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L243

They don't include {{cpp/examples}}.

> Use CI for linting
> --
>
> Key: ARROW-14291
> URL: https://issues.apache.org/jira/browse/ARROW-14291
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> Currently the development process requires the developer to lint their code 
> before committing it. This can be inefficient for changes made in the browser 
> and when one has a different compiler setup than that used for linting. As 
> described in 
> [https://dev.to/flipp-engineering/linting-only-changed-files-with-github-actions-4ddp]
>  development efficiency can be improved if altered code is automatically 
> linted by a CI action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14291) Use CI for linting

2021-10-11 Thread Benson Muite (Jira)
Benson Muite created ARROW-14291:


 Summary: Use CI for linting
 Key: ARROW-14291
 URL: https://issues.apache.org/jira/browse/ARROW-14291
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benson Muite
Assignee: Benson Muite


Currently the development process requires the developer to lint their code 
before committing it. This can be inefficient for changes made in the browser 
and when one has a different compiler setup than that used for linting. As 
described in 
[https://dev.to/flipp-engineering/linting-only-changed-files-with-github-actions-4ddp]
 development efficiency can be improved if altered code is automatically linted 
by a CI action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14269) [C++] Consolidate utf8 benchmark

2021-10-11 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-14269.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11376
[https://github.com/apache/arrow/pull/11376]

> [C++] Consolidate utf8 benchmark
> 
>
> Key: ARROW-14269
> URL: https://issues.apache.org/jira/browse/ARROW-14269
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I find some trivial (and obviously irrelevant) changes to UTF8 validation 
> code may cause big variances to benchmark result.
> UTF8 validation functions are inlined and called directly in benchmark. The 
> compiler may try to optimize them together with the benchmark loop.
> Un-inline the benchmark-ed functions makes the result predictable and 
> explainable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14290) [C++] String comparison in between ternary kernel

2021-10-11 Thread Benson Muite (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427434#comment-17427434
 ] 

Benson Muite commented on ARROW-14290:
--

Yes, initial implementation does not allow for comparison of strings with keys, 
which is important for many applications. As you pointed out, a related issue 
for sorting is https://issues.apache.org/jira/browse/ARROW-12046

> [C++] String comparison in between ternary kernel
> -
>
> Key: ARROW-14290
> URL: https://issues.apache.org/jira/browse/ARROW-14290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> String comparisons in C++ will use order by unicode. This may not be suitable 
> in many language applications, for example when using characters from 
> languages that use more than ASCII.   Sorting algorithms can often allow for 
> the use of custom comparison functions.  It would be helpful to allow for 
> this for the between kernel as well.  Initial work on the between kernel is 
> being tracked in https://issues.apache.org/jira/browse/ARROW-9843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14260) [C++] GTest linker error with vcpkg and Visual Studio 2019

2021-10-11 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427347#comment-17427347
 ] 

Kouhei Sutou commented on ARROW-14260:
--

Can we see build command line for the link failure?

> [C++] GTest linker error with vcpkg and Visual Studio 2019
> --
>
> Key: ARROW-14260
> URL: https://issues.apache.org/jira/browse/ARROW-14260
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> The *test-build-vcpkg-win* nightly Crossbow job is failing with these linker 
> errors:
> {code:java}
>  unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) void __cdecl 
> testing::internal2::PrintBytesInObjectTo(unsigned char const *,unsigned 
> __int64,class std::basic_ostream > *)" 
> (__imp_?PrintBytesInObjectTo@internal2@testing@@YAXPEBE_KPEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@@Z)
>  referenced in function "class std::basic_ostream std::char_traits > & __cdecl testing::internal2::operator<< std::char_traits,class std::_Vector_iterator std::_Vector_val 
> > > >(class std::basic_ostream > &,class 
> std::_Vector_iterator arrow::compute::ExecNode *> > > const &)" 
> (??$?6DU?$char_traits@D@std@@V?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@1@@internal2@testing@@YAAEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@AEAV23@AEBV?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@3@@Z)
>  
> unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> referenced in function "void __cdecl arrow::fs::AssertFileContents(class 
> arrow::fs::FileSystem *,class std::basic_string std::char_traits,class std::allocator > const &,class 
> std::basic_string,class 
> std::allocator > const &)" 
> (?AssertFileContents@fs@arrow@@YAXPEAVFileSystem@12@AEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@1@Z)
>  
> unity_0_cxx.obj : error LNK2001: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> {code}
> Link to the error where it occurs in the full log: 
> https://github.com/ursacomputing/crossbow/runs/3799925986#step:4:2737



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14290) [C++] String comparison in between ternary kernel

2021-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427318#comment-17427318
 ] 

Antoine Pitrou commented on ARROW-14290:


Is this different from ARROW-12046 ?

> [C++] String comparison in between ternary kernel
> -
>
> Key: ARROW-14290
> URL: https://issues.apache.org/jira/browse/ARROW-14290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> String comparisons in C++ will use order by unicode. This may not be suitable 
> in many language applications, for example when using characters from 
> languages that use more than ASCII.   Sorting algorithms can often allow for 
> the use of custom comparison functions.  It would be helpful to allow for 
> this for the between kernel as well.  Initial work on the between kernel is 
> being tracked in https://issues.apache.org/jira/browse/ARROW-9843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14290) [C++] String comparison in between ternary kernel

2021-10-11 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-14290:
---
Component/s: C++

> [C++] String comparison in between ternary kernel
> -
>
> Key: ARROW-14290
> URL: https://issues.apache.org/jira/browse/ARROW-14290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> String comparisons in C++ will use order by unicode. This may not be suitable 
> in many language applications, for example when using characters from 
> languages that use more than ASCII.   Sorting algorithms can often allow for 
> the use of custom comparison functions.  It would be helpful to allow for 
> this for the between kernel as well.  Initial work on the between kernel is 
> being tracked in https://issues.apache.org/jira/browse/ARROW-9843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14252) [R] Partial matching of arguments warning

2021-10-11 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14252.
-
Fix Version/s: (was: 7.0.0)
   6.0.0
   Resolution: Fixed

Issue resolved by pull request 11371
[https://github.com/apache/arrow/pull/11371]

> [R] Partial matching of arguments warning
> -
>
> Key: ARROW-14252
> URL: https://issues.apache.org/jira/browse/ARROW-14252
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> There are a few examples of partially matched arguments in the code.  One 
> example is below, but there could be others.
> {code:r}
> Failure (test-dplyr-query.R:46:3): dim() on query
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` threw an unexpected warning.
> Message: partial match of 'filtered' to 'filtered_rows'
> Class:   simpleWarning/warning/condition
> Backtrace:
>   1. arrow:::expect_dplyr_equal(...) test-dplyr-query.R:46:2
>  11. arrow::dim.arrow_dplyr_query(.)
>  12. base::isTRUE(x$filtered) /Users/dragos/Documents/arrow/r/R/dplyr.R:147:2
> Failure (test-dplyr-query.R:46:3): dim() on query
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` threw an unexpected warning.
> Message: partial match of 'filtered' to 'filtered_rows'
> Class:   simpleWarning/warning/condition
> Backtrace:
>   1. arrow:::expect_dplyr_equal(...) test-dplyr-query.R:46:2
>  11. arrow::dim.arrow_dplyr_query(.)
>  12. base::isTRUE(x$filtered) /Users/dragos/Documents/arrow/r/R/dplyr.R:147:2
> {code}
> This is the relevant line of code in the example above: 
> https://github.com/apache/arrow/blob/25a6f591d1f162106b74e29870ebd4012e9874cc/r/R/dplyr.R#L150



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14074) [C++][Compute] Sketch a C++ consumer of compute IR

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14074:
---
Labels: pull-request-available query-engine  (was: query-engine)

> [C++][Compute] Sketch a C++ consumer of compute IR
> --
>
> Key: ARROW-14074
> URL: https://issues.apache.org/jira/browse/ARROW-14074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Compute IR
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-14062 adds a basic compute Intermediate Representation. Allowing c++ 
> compute to consume this and produce ExecPlans will allow more straightforward 
> and less tightly coupled usage of ExecPlans from bindings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2021-10-11 Thread Rares Vernica (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rares Vernica updated ARROW-14161:
--
Description: 
Missing documentation on Reading/Writing Parquet files C++ api:
 * 
[WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
 missing docs on chunk_size found some 
[here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
 _size of the RowGroup in the parquet file. Normally you would choose this to 
be rather large_
 * Typo in file reader 
[example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the include 
should be {{#include "parquet/arrow/reader.h"}}
 * 
[WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
 missing docs on {{compression}}
 * Missing example on using WriteProperties

  was:
Missing documentation on Reading/Writing Parquet files C++ api:
 * 
[WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
 missing docs on chunk_size found some 
[here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
 _size of the RowGroup in the parquet file. Normally you would choose this to 
be rather large_
 * Typo in file reader 
[example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the include 
should be {{#include "parquet/arrow/reader.h"}}
 * 
{{[WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
 missing docs on compression}}

 * Missing example on using WriteProperties


> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Priority: Minor
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_
>  * Typo in file reader 
> [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the 
> include should be {{#include "parquet/arrow/reader.h"}}
>  * 
> [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
>  missing docs on {{compression}}
>  * Missing example on using WriteProperties



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2021-10-11 Thread Rares Vernica (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rares Vernica updated ARROW-14161:
--
Description: 
Missing documentation on Reading/Writing Parquet files C++ api:
 * 
[WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
 missing docs on chunk_size found some 
[here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
 _size of the RowGroup in the parquet file. Normally you would choose this to 
be rather large_
 * Typo in file reader 
[example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the include 
should be {{#include "parquet/arrow/reader.h"}}
 * 
{{[WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
 missing docs on compression}}

 * Missing example on using WriteProperties

  was:
Missing documentation on Reading/Writing Parquet files C++ api:
 * 
[WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
 missing docs on chunk_size found some 
[here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
 _size of the RowGroup in the parquet file. Normally you would choose this to 
be rather large_
 * Typo in file reader 
[example|https://arrow.apache.org/docs/cpp/parquet.html#filereader]  the 
include should be {{#include "parquet/arrow/reader.h"}}


> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Priority: Minor
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_
>  * Typo in file reader 
> [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the 
> include should be {{#include "parquet/arrow/reader.h"}}
>  * 
> {{[WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
>  missing docs on compression}}
>  * Missing example on using WriteProperties



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9843) [C++] Implement Between ternary kernel

2021-10-11 Thread Benson Muite (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Muite updated ARROW-9843:

Summary: [C++] Implement Between ternary kernel  (was: [C++] Implement 
Between trinary kernel)

> [C++] Implement Between ternary kernel
> --
>
> Key: ARROW-9843
> URL: https://issues.apache.org/jira/browse/ARROW-9843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benson Muite
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A specialized {{between(arr, left_bound, right_bound)}} kernel would avoid 
> multiple scans and AND operation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14262) [C++] Document and rename is_in_meta_binary

2021-10-11 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-14262:
-
Labels: kernel  (was: )

> [C++] Document and rename is_in_meta_binary
> ---
>
> Key: ARROW-14262
> URL: https://issues.apache.org/jira/browse/ARROW-14262
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: kernel
>
> The is_in_meta_binary and index_in_meta_binary functions do not have any 
> "_doc" elements.  I had simply ignored them assuming they were some kind of 
> specialized function that shouldn't be exposed for general consumption (see 
> ARROW-13949) but I recently discovered they are legitimate binary variants of 
> their unary counterparts.
> If we want to continue to expose these functions we should rename them (meta 
> I assume means meta function but the python/r user has no idea what a meta 
> function is) and add _doc elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14289) [C++] Change Scanner::Head to return a RecordBatchReader

2021-10-11 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-14289:
---

 Summary: [C++] Change Scanner::Head to return a RecordBatchReader
 Key: ARROW-14289
 URL: https://issues.apache.org/jira/browse/ARROW-14289
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, R
Reporter: Neal Richardson
 Fix For: 7.0.0


Following ARROW-9731 and ARROW-13893. This would make it more natural to work 
with ExecPlans that return a RecordBatchReader when you Run them. 
Alternatively, we could move the business to RecordBatchReader::Head.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14288) [R] Implement nrow on some collapsed queries

2021-10-11 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-14288:
---

 Summary: [R] Implement nrow on some collapsed queries
 Key: ARROW-14288
 URL: https://issues.apache.org/jira/browse/ARROW-14288
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 7.0.0


collapse() doesn't always mean we can't determine the number of rows. We can 
try to solve some cases:

* head/tail: compute number of rows, take the smaller of that and the head/tail 
number
* if filter == TRUE, take the number of rows of .data (which may contain a 
query)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9688) Supporting Windows ARM64 builds

2021-10-11 Thread Niyas (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427264#comment-17427264
 ] 

Niyas commented on ARROW-9688:
--

I've created a [PR|https://github.com/apache/arrow/pull/11383] to enable 
building with clang-cl

 

> Supporting Windows ARM64 builds
> ---
>
> Key: ARROW-9688
> URL: https://issues.apache.org/jira/browse/ARROW-9688
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
> Environment: Windows
>Reporter: Mukul Sabharwal
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 336h
>  Time Spent: 20m
>  Remaining Estimate: 335h 40m
>
> I was trying to build the Arrow library so I could use it to generate parquet 
> files on Windows ARM64, but it currently fails to compile for a few reasons. 
> I thought I'd enumerate them here, so someone more familiar with the project 
> could spearhead it.
> In SetupCxxFlags.cmake
>  * the MSVC branch for ARROW_CPU_FLAG STREQUAL "x86" is taken even though I'm 
> building ARM64, this may be a more fundamental error somewhere else that 
> needs correction and maybe things would work better, but an inspection of 
> other branches seemed to indicate that ARM64 is assumed to be missing from 
> MSVC and the keywrod "aarch64" (not a term used in the Windows ecosystem) is 
> prevalent in the cmake files. So the first thing I did was I stubbed it out 
> and set SSE42, AVX and AVX512 to be not present
>  * In bit_util.h I provided implementations for popcount32, popcount64 that 
> were not neon accelerated, although neon_cnt is provided by msvc (for n64)
>  * Removed nmintrin.h since that is x64/x64 specific. Note, _BitScanReverse 
> and _BitScanForward are Microsoft specific and support on ARM64.
>  * cpu_info.cc needed tweaks for cpuid stuff, I just returned false and 
> didn't really care too much about any upstream effects. flag_mappings and 
> num_flags ought be defined in the not WIN32 ifdef, since they're not actually 
> used.
> After these changes I was able to remove the vcpkg restriction that 
> artificially failed the library from compiling on arm64 and I was able to 
> successfully compile for both arm64-windows-static and arm64-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9688) Supporting Windows ARM64 builds

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9688:
--
Labels: pull-request-available  (was: )

> Supporting Windows ARM64 builds
> ---
>
> Key: ARROW-9688
> URL: https://issues.apache.org/jira/browse/ARROW-9688
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
> Environment: Windows
>Reporter: Mukul Sabharwal
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 336h
>  Time Spent: 10m
>  Remaining Estimate: 335h 50m
>
> I was trying to build the Arrow library so I could use it to generate parquet 
> files on Windows ARM64, but it currently fails to compile for a few reasons. 
> I thought I'd enumerate them here, so someone more familiar with the project 
> could spearhead it.
> In SetupCxxFlags.cmake
>  * the MSVC branch for ARROW_CPU_FLAG STREQUAL "x86" is taken even though I'm 
> building ARM64, this may be a more fundamental error somewhere else that 
> needs correction and maybe things would work better, but an inspection of 
> other branches seemed to indicate that ARM64 is assumed to be missing from 
> MSVC and the keywrod "aarch64" (not a term used in the Windows ecosystem) is 
> prevalent in the cmake files. So the first thing I did was I stubbed it out 
> and set SSE42, AVX and AVX512 to be not present
>  * In bit_util.h I provided implementations for popcount32, popcount64 that 
> were not neon accelerated, although neon_cnt is provided by msvc (for n64)
>  * Removed nmintrin.h since that is x64/x64 specific. Note, _BitScanReverse 
> and _BitScanForward are Microsoft specific and support on ARM64.
>  * cpu_info.cc needed tweaks for cpuid stuff, I just returned false and 
> didn't really care too much about any upstream effects. flag_mappings and 
> num_flags ought be defined in the not WIN32 ifdef, since they're not actually 
> used.
> After these changes I was able to remove the vcpkg restriction that 
> artificially failed the library from compiling on arm64 and I was able to 
> successfully compile for both arm64-windows-static and arm64-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-11 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427218#comment-17427218
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

bq. 2.  Make it possible to select columns by eliding the list name components.

Currently, I think the C++ API only deals with column indices? (at least for 
the Python bindings, the translation of column names to field indices happens 
in Python) For Python that should be relatively straightforward to implement. 
Opened ARROW-14286 for this. 

bq. If so, I'd have to support both naming conventions because both would exist 
in the wild.

[~jpivarski] yes, but that's already the case right now as well. Parquet files 
written by (py)arrow will use a different name for the list element compared to 
parquet files written by other tools (that's actually what we are trying to 
harmonize). So if you select a subfield of a list field by name, you already 
need to take into account potentially different names at the moment. 

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 6.0.0
>
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14287) [R] Selecting colums while reading Parquet file with nested types can give wrong column

2021-10-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14287:
-

 Summary: [R] Selecting colums while reading Parquet file with 
nested types can give wrong column
 Key: ARROW-14287
 URL: https://issues.apache.org/jira/browse/ARROW-14287
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Joris Van den Bossche


I created two small files (using Python for my convenience):

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({"a": [1, 2], "b": [3, 4]})
pq.write_table(table, "test1.parquet")

table = pa.table({"a": [1, 2], "nested": [[{'f1': 1, 'f2': 3}, {'f1': 3, 'f2': 
4}], None], "b": [3, 4]})
pq.write_table(table, "test2.parquet")
{code}

where the first is a simple file, and the second contains a column with a 
nested list of struct type.

Reading that in R with a column selection works in the first case, but actually 
reads the second column instead of third in the second case:

{code:r}
> arrow::read_parquet("test1.parquet", col_select=c("b"))
  b
1 3
2 4
> arrow::read_parquet("test2.parquet", col_select=c("b"))
  nested
1   3, 4
2   NULL
{code}

This is due to the simple conversion of column names to integer indices in the 
R code, while Parquet counts the individual fields of nested columns separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13944) [C++] Bump xsimd to latest version

2021-10-11 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13944.

Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11142
[https://github.com/apache/arrow/pull/11142]

> [C++] Bump xsimd to latest version
> --
>
> Key: ARROW-13944
> URL: https://issues.apache.org/jira/browse/ARROW-13944
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> xsimd is refactored to use architecture instead of register size to define a 
> batch.
> I've adapted arrow code to this change.
> There's one xsimd bug [1] needs to fix before we can upgrade.
> [1] https://github.com/xtensor-stack/xsimd/pull/553



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14286) [Python][Parquet] Allow to select columns of a list field without requiring the list component names

2021-10-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14286:
-

 Summary: [Python][Parquet] Allow to select columns of a list field 
without requiring the list component names
 Key: ARROW-14286
 URL: https://issues.apache.org/jira/browse/ARROW-14286
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Subtask for ARROW-14196.

Currently, if you have a list column, where the list elements itself are nested 
items (eg a list of structs), selecting a subset of that list column requires 
something like {{columns=["columnA.list.item.subfield"]}}. While this 
"list.item" is superfluous, since a list always contains a single child. So 
ideally we allow to specify this as {{columns=["columnA.subfield"]}}. 

This also avoids relying on the exact name of the list item (item vs element), 
for which the default differs between Parquet and Arrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14285) [C++] Fix crashes when pretty-printing data from valid IPC file (OSS-Fuzz)

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14285:
---
Labels: pull-request-available  (was: )

> [C++] Fix crashes when pretty-printing data from valid IPC file (OSS-Fuzz)
> --
>
> Key: ARROW-14285
> URL: https://issues.apache.org/jira/browse/ARROW-14285
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>
> Fix the following issues found by OSS-Fuzz:
> * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39677
> * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39703
> * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39763
> * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39773



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14285) [C++] Fix crashes when pretty-printing data from valid IPC file (OSS-Fuzz)

2021-10-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-14285:
--

 Summary: [C++] Fix crashes when pretty-printing data from valid 
IPC file (OSS-Fuzz)
 Key: ARROW-14285
 URL: https://issues.apache.org/jira/browse/ARROW-14285
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 6.0.0


Fix the following issues found by OSS-Fuzz:
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39677
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39703
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39763
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39773




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10140) [Python][C++] Add test for map column of a parquet file created from pyarrow and pandas

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10140:
--
Fix Version/s: (was: 6.0.0)
   7.0.0

> [Python][C++] Add test for map column of a parquet file created from pyarrow 
> and pandas
> ---
>
> Key: ARROW-10140
> URL: https://issues.apache.org/jira/browse/ARROW-10140
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Chen Ming
>Assignee: Joris Van den Bossche
>Priority: Minor
> Fix For: 7.0.0
>
> Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, 
> test_map_2.0.0.parquet, test_map_200.parquet
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>  'col1': pd.Series([
>  [('id', 'something'), ('value2', 'else')],
>  [('id', 'something2'), ('value','else2')],
>  ]),
>  'col2': pd.Series(['foo', 'bar'])
>  })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
> repeated group key_value {
>   required binary key (STRING);
>   optional binary value (STRING);
> }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14284) [C++][Python] Improve error message when trying use SyncScanner when requiring async

2021-10-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-14284:
-

 Summary: [C++][Python] Improve error message when trying use 
SyncScanner when requiring async
 Key: ARROW-14284
 URL: https://issues.apache.org/jira/browse/ARROW-14284
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


See ARROW-14257

The current error message gives "Asynchronous scanning is not supported by 
SyncScanner"

Copying the comment of [~westonpace]:

In Python it is always use_async=True. In R the scanner is hidden from the user 
on dataset writes but the option there is use_async as well. In C++ the option 
is UseAsync in the ScannerBuilder. How about,

"Writing datasets requires that the input scanner is configured to scan 
asynchronously via the use_async or UseAsync options."




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14257) [Doc][Python] dataset doc build fails

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-14257.
---
Resolution: Fixed

Issue resolved by pull request 11364
[https://github.com/apache/arrow/pull/11364]

> [Doc][Python] dataset doc build fails
> -
>
> Key: ARROW-14257
> URL: https://issues.apache.org/jira/browse/ARROW-14257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {code}
> >>>-
> Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block 
> ending on line 578
> Specify :okexcept: as an option in the ipython:: block to suppress this 
> message
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 ds.write_dataset(scanner, new_root, format="parquet", 
> partitioning=new_part)
> ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
> basename_template, format, partitioning, partitioning_flavor, schema, 
> filesystem, file_options, use_threads, max_partitions, file_visitor)
> 861 _filesystemdataset_write(
> 862 scanner, base_dir, basename_template, filesystem, 
> partitioning,
> --> 863 file_options, max_partitions, file_visitor
> 864 )
> ~/arrow/dev/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Asynchronous scanning is not supported by 
> SyncScanner
> /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367  
> scanner->ScanBatchesAsync()
> <<<-
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-2659:
-
Fix Version/s: (was: 6.0.0)
   7.0.0

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.9.0
>Reporter: Uwe Korn
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet
> Fix For: 7.0.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5248) [Python] support dateutil timezones

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5248:
-
Fix Version/s: (was: 6.0.0)
   7.0.0

> [Python] support dateutil timezones
> ---
>
> Key: ARROW-5248
> URL: https://issues.apache.org/jira/browse/ARROW-5248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: beginner
> Fix For: 7.0.0
>
>
> The {{dateutil}} packages also provides a set of timezone objects 
> (https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. 
> In pyarrow, we only support pytz timezones (and the stdlib datetime.timezone 
> fixed offset):
> {code}
> In [2]: import dateutil.tz
>   
>   
> In [3]: import pyarrow as pa  
>   
>   
> In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels'))  
>   
>   
> ...
> ~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in 
> pyarrow.lib.tzinfo_to_string()
> ValueError: Unable to convert timezone 
> `tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string
> {code}
> But pandas also supports dateutil timezones. As a consequence, when having a 
> pandas DataFrame that uses a dateutil timezone, you get an error when 
> converting to an arrow table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10726) [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10726:
--
Fix Version/s: (was: 6.0.0)
   7.0.0

> [Python] Reading multiple parquet files with different index column dtype 
> (originating pandas) reads wrong data
> ---
>
> Key: ARROW-10726
> URL: https://issues.apache.org/jira/browse/ARROW-10726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 7.0.0
>
>
> See https://github.com/pandas-dev/pandas/issues/38058



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-14004) [Python] to_pandas() converts to float instead of using pandas nullable types

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-14004:
-

Assignee: Joris Van den Bossche

> [Python] to_pandas() converts to float instead of using pandas nullable types
> -
>
> Key: ARROW-14004
> URL: https://issues.apache.org/jira/browse/ARROW-14004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Miguel Cantón Cortés
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pandas
> Fix For: 6.0.0
>
> Attachments: image.png
>
>
> We've noticed that when converting an Arrow Table to pandas using 
> `.to_pandas()` integer columns with null values get converted to float 
> instead of using pandas nullable types.
> If the column was created with pandas first it is correctly preserved (I 
> guess it's using stored metadata for this).
> I've attached a screenshot showing this behavior.
> As currently there is support for nullable types in pandas, just as in Arrow, 
> it would be great to use these types when dealing with columns with null 
> values.
> If you are reticent to change this behavior, a param would be nice too (e.g. 
> `to_pandas(use_nullable_types: True)`).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13525:
---
Labels: pull-request-available  (was: )

> [Python] Mention alternatives in deprecation message of ParquetDataset 
> attributes
> -
>
> Key: ARROW-13525
> URL: https://issues.apache.org/jira/browse/ARROW-13525
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.1, 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Follow-up on ARROW-13074. 
> We should maybe also expose the {{partitioning}} attribute on ParquetDataset 
> (if constructed with {{use_legacy_dataset=False}}), as I did for the 
> {{filesystem}}/{{files}}/{{fragments}} attributes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14283) [C++][CI] LLVM cannot be found on macOS GHA builds

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14283:
---
Labels: pull-request-available  (was: )

> [C++][CI] LLVM cannot be found on macOS GHA builds
> --
>
> Key: ARROW-14283
> URL: https://issues.apache.org/jira/browse/ARROW-14283
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/pull/11372/checks?check_run_id=3859972940
> https://github.com/apache/arrow/pull/11372/checks?check_run_id=3859973472
> https://github.com/apache/arrow/pull/11372/checks?check_run_id=3859973399



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14282) [R] altrep vectors for factors (dictionaries)

2021-10-11 Thread Romain Francois (Jira)
Romain Francois created ARROW-14282:
---

 Summary: [R] altrep vectors for factors (dictionaries)
 Key: ARROW-14282
 URL: https://issues.apache.org/jira/browse/ARROW-14282
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Romain Francois
Assignee: Romain Francois


As it is the case in Converter_Dictionary, this should probably be split into 2 
paths depending on whether the arrays needs unification (all the levels are the 
same)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14281) How to Review PRs Guidelines

2021-10-11 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-14281:
-

 Summary: How to Review PRs Guidelines
 Key: ARROW-14281
 URL: https://issues.apache.org/jira/browse/ARROW-14281
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alessandro Molina
Assignee: Antoine Pitrou
 Fix For: 7.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14280) R-Arrow Architectural Overview

2021-10-11 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14280:
--
Parent: ARROW-14278
Issue Type: Sub-task  (was: Improvement)

> R-Arrow Architectural Overview
> --
>
> Key: ARROW-14280
> URL: https://issues.apache.org/jira/browse/ARROW-14280
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14279) PyArrow Architectural Overview

2021-10-11 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-14279:
-

 Summary: PyArrow Architectural Overview
 Key: ARROW-14279
 URL: https://issues.apache.org/jira/browse/ARROW-14279
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Alessandro Molina
Assignee: Alessandro Molina
 Fix For: 7.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14280) R-Arrow Architectural Overview

2021-10-11 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-14280:
-

 Summary: R-Arrow Architectural Overview
 Key: ARROW-14280
 URL: https://issues.apache.org/jira/browse/ARROW-14280
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alessandro Molina






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14278) New Contributors Guide

2021-10-11 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-14278:
-

 Summary: New Contributors Guide
 Key: ARROW-14278
 URL: https://issues.apache.org/jira/browse/ARROW-14278
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alessandro Molina
 Fix For: 7.0.0


Umbrella Issue for the Guide for new contributors for Python and R



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14277) R Tutorials 2021-Q4 Initiative

2021-10-11 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-14277:
-

 Summary: R Tutorials 2021-Q4 Initiative
 Key: ARROW-14277
 URL: https://issues.apache.org/jira/browse/ARROW-14277
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alessandro Molina
 Fix For: 7.0.0


An umbrella ticket for the initiative of writing up a set of Tutorials for R 
users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14259) [R] converting from R vector to Array when the R vector is altrep

2021-10-11 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14259.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11366
[https://github.com/apache/arrow/pull/11366]

> [R] converting from R vector to Array when the R vector is altrep
> -
>
> Key: ARROW-14259
> URL: https://issues.apache.org/jira/browse/ARROW-14259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When we have an R vector that was created from an Array with altrep, and then 
> we want to convert again to an Array, currently it materializes it, and it 
> should not. Instead it should be grabbing the array from the internals -of 
> the altrep object. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14260) [C++] GTest linker error with vcpkg and Visual Studio 2019

2021-10-11 Thread Ian Cook (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425872#comment-17425872
 ] 

Ian Cook edited comment on ARROW-14260 at 10/11/21, 1:51 PM:
-

[~apitrou] this seems like it might be related to ARROW-14247. Let's see if the 
fix for that in [#11356|https://github.com/apache/arrow/pull/11356/] makes this 
go away. (update: it did not)


was (Author: icook):
[~apitrou] this seems like it might be related to ARROW-14247. Let's see if the 
fix for that in [#11356|https://github.com/apache/arrow/pull/11356/] makes this 
go away.

> [C++] GTest linker error with vcpkg and Visual Studio 2019
> --
>
> Key: ARROW-14260
> URL: https://issues.apache.org/jira/browse/ARROW-14260
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> The *test-build-vcpkg-win* nightly Crossbow job is failing with these linker 
> errors:
> {code:java}
>  unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) void __cdecl 
> testing::internal2::PrintBytesInObjectTo(unsigned char const *,unsigned 
> __int64,class std::basic_ostream > *)" 
> (__imp_?PrintBytesInObjectTo@internal2@testing@@YAXPEBE_KPEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@@Z)
>  referenced in function "class std::basic_ostream std::char_traits > & __cdecl testing::internal2::operator<< std::char_traits,class std::_Vector_iterator std::_Vector_val 
> > > >(class std::basic_ostream > &,class 
> std::_Vector_iterator arrow::compute::ExecNode *> > > const &)" 
> (??$?6DU?$char_traits@D@std@@V?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@1@@internal2@testing@@YAAEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@AEAV23@AEBV?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@3@@Z)
>  
> unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> referenced in function "void __cdecl arrow::fs::AssertFileContents(class 
> arrow::fs::FileSystem *,class std::basic_string std::char_traits,class std::allocator > const &,class 
> std::basic_string,class 
> std::allocator > const &)" 
> (?AssertFileContents@fs@arrow@@YAXPEAVFileSystem@12@AEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@1@Z)
>  
> unity_0_cxx.obj : error LNK2001: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> {code}
> Link to the error where it occurs in the full log: 
> https://github.com/ursacomputing/crossbow/runs/3799925986#step:4:2737



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14213) R arrow package not working on RStudio/Ubuntu

2021-10-11 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427137#comment-17427137
 ] 

Neal Richardson commented on ARROW-14213:
-

Re-reading the logs, it looks like the installation was successful, but you 
were installing arrow from a session where you already had arrow loaded. Have 
you restarted R?

> R arrow package not working on RStudio/Ubuntu
> -
>
> Key: ARROW-14213
> URL: https://issues.apache.org/jira/browse/ARROW-14213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
> Copyright (C) 2020 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
>Reporter: Thomas Wutzler
>Priority: Major
>
> I try reading feather files in R with the arrow package that were generated 
> in Python.
> I run on R 3.6.3 on an RStudio server window on  linux machine, for which I 
> have no other access. I get the message:
>  {{Cannot call io___MemoryMappedFile__Open().}}
> According to the advice in the linked help-file: 
> [https://cran.r-project.org/web/packages/arrow/vignettes/install.html] I 
> create this issue with the full log of the installation:
> {{}}
> > arrow::install_arrow(verbose = TRUE)Installing package into 
> > '/Net/Groups/BGI/scratch/twutz/R/atacama-library/3.6'
> (as 'lib' is unspecified)trying URL 
> 'https://ftp5.gwdg.de/pub/misc/cran/src/contrib/arrow_5.0.0.2.tar.gz'Content 
> type 'application/octet-stream' length 483642 bytes (472 
> KB)==downloaded 472 KB* 
> installing *source* package 'arrow' ...** package 'arrow' successfully 
> unpacked and MD5 sums checked** using staged installationtrying URL 
> 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/ubuntu-16.04/arrow-5.0.0.2.zip'Content
>  type 'binary/octet-stream' length 17214781 bytes (16.4 
> MB)==downloaded 16.4 MB*** 
> Successfully retrieved C++ binaries for ubuntu-16.04
>  Binary package requires libcurl and openssl
>  If installation fails, retry after installing those system requirements
> PKG_CFLAGS=-I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include
>   -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3
> PKG_LIBS=-L/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/lib
>  -larrow_dataset -lparquet -larrow -larrow -larrow_bundled_dependencies 
> -larrow_dataset -lparquet -lssl -lcrypto -lcurl** libsg++ -std=gnu++11 
> -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c RTasks.cpp -o RTasks.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c altrep.cpp -o altrep.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c array.cpp -o array.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c array_to_vector.cpp -o array_to_vector.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c arraydata.cpp -o arraydata.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> 

[jira] [Closed] (ARROW-14276) [Packaging] Dependency resolution issues in the nightly conda builds

2021-10-11 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-14276.
--
Fix Version/s: (was: 6.0.0)
   Resolution: Duplicate

> [Packaging] Dependency resolution issues in the nightly conda builds
> 
>
> Key: ARROW-14276
> URL: https://issues.apache.org/jira/browse/ARROW-14276
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>
> The majority of the conda nightly builds are failing due to dependency 
> resolution problems:
> {code}
> - conda-linux-gcc-py37-arm64:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py37-arm64
> - conda-linux-gcc-py37-cpu-r41:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py37-cpu-r41
> - conda-linux-gcc-py37-cuda:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py37-cuda
> - conda-linux-gcc-py38-arm64:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py38-arm64
> - conda-linux-gcc-py38-cpu:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py38-cpu
> - conda-linux-gcc-py38-cuda:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py38-cuda
> - conda-linux-gcc-py39-arm64:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py39-arm64
> - conda-linux-gcc-py39-cpu:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py39-cpu
> - conda-linux-gcc-py39-cuda:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py39-cuda
> - conda-win-vs2017-py36-r40:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-win-vs2017-py36-r40
> - conda-win-vs2017-py38:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-win-vs2017-py38
> - conda-win-vs2017-py39:
>   URL: 
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-win-vs2017-py39
> {code}
> I assume that we need to sync the recipes again with up to date pin files. 
> cc @uwe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14275) [C++][Parquet][Doc] default output option for Parquet Scan Example

2021-10-11 Thread Benson Muite (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Muite updated ARROW-14275:
-
Summary: [C++][Parquet][Doc] default output option for Parquet Scan Example 
 (was: should have default output option)

> [C++][Parquet][Doc] default output option for Parquet Scan Example
> --
>
> Key: ARROW-14275
> URL: https://issues.apache.org/jira/browse/ARROW-14275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> [Parquet scan 
> example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/dataset_parquet_scan_example.cc]
>   should not fake success if no argument is given, but should instead create 
> a new directory in the current directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14276) [Packaging] Dependency resolution issues in the nightly conda builds

2021-10-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-14276:
---

 Summary: [Packaging] Dependency resolution issues in the nightly 
conda builds
 Key: ARROW-14276
 URL: https://issues.apache.org/jira/browse/ARROW-14276
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Packaging
Reporter: Krisztian Szucs
 Fix For: 6.0.0


The majority of the conda nightly builds are failing due to dependency 
resolution problems:

{code}
- conda-linux-gcc-py37-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py37-arm64
- conda-linux-gcc-py37-cpu-r41:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py37-cpu-r41
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py38-arm64
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py39-arm64
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gcc-py39-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-linux-gcc-py39-cuda
- conda-win-vs2017-py36-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-win-vs2017-py36-r40
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-win-vs2017-py38
- conda-win-vs2017-py39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-10-11-0-azure-conda-win-vs2017-py39
{code}

I assume that we need to sync the recipes again with up to date pin files. 

cc @uwe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13710) [Doc][Cookbook] Sending and receiving data over a network using an Arrow Flight RPC server - Python

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13710:
---
Labels: pull-request-available  (was: )

> [Doc][Cookbook] Sending and receiving data over a network using an Arrow 
> Flight RPC server - Python
> ---
>
> Key: ARROW-13710
> URL: https://issues.apache.org/jira/browse/ARROW-13710
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14275) should have default output option

2021-10-11 Thread Benson Muite (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Muite updated ARROW-14275:
-
Component/s: Documentation

> should have default output option
> -
>
> Key: ARROW-14275
> URL: https://issues.apache.org/jira/browse/ARROW-14275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> [Parquet scan 
> example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/dataset_parquet_scan_example.cc]
>   should not fake success if no argument is given, but should instead create 
> a new directory in the current directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14275) should have default output option

2021-10-11 Thread Benson Muite (Jira)
Benson Muite created ARROW-14275:


 Summary: should have default output option
 Key: ARROW-14275
 URL: https://issues.apache.org/jira/browse/ARROW-14275
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet
Reporter: Benson Muite
Assignee: Benson Muite


[Parquet scan 
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/dataset_parquet_scan_example.cc]
  should not fake success if no argument is given, but should instead create a 
new directory in the current directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14274) [C++] Upgrade vendored base64 code

2021-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427069#comment-17427069
 ] 

Antoine Pitrou commented on ARROW-14274:


I am not aware that base64 is performance critical currently. That said, I'm ok 
with improving the code, or even using a totally different implementation if 
desired.

> [C++] Upgrade vendored base64 code
> --
>
> Key: ARROW-14274
> URL: https://issues.apache.org/jira/browse/ARROW-14274
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>
> The vendored base64 code looks suboptimal. [1]
> We should at least upgrade to latest upstream code which improved a lot. [2]
> Maybe adopt more optimized implementation if base64 performance matters for 
> arrow. [3]
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/base64.cpp#L49
> [2] https://github.com/ReneNyffenegger/cpp-base64/blob/master/base64.cpp#L129
> [3] https://github.com/aklomp/base64



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14274) [C++] Upgrade vendored base64 code

2021-10-11 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-14274:


 Summary: [C++] Upgrade vendored base64 code
 Key: ARROW-14274
 URL: https://issues.apache.org/jira/browse/ARROW-14274
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


The vendored base64 code looks suboptimal. [1]
We should at least upgrade to latest upstream code which improved a lot. [2]
Maybe adopt more optimized implementation if base64 performance matters for 
arrow. [3]

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/base64.cpp#L49
[2] https://github.com/ReneNyffenegger/cpp-base64/blob/master/base64.cpp#L129
[3] https://github.com/aklomp/base64



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13730) [Doc][Cookbook] Adding a column to an existing Table - Python

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13730:
---
Labels: pull-request-available  (was: )

> [Doc][Cookbook] Adding a column to an existing Table - Python
> -
>
> Key: ARROW-13730
> URL: https://issues.apache.org/jira/browse/ARROW-13730
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13317) [Python] Improve documentation on what 'use_threads' does in 'read_feather'

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13317:
--
Labels: documentation good-first-issue  (was: documentation)

> [Python] Improve documentation on what 'use_threads' does in 'read_feather'
> ---
>
> Key: ARROW-13317
> URL: https://issues.apache.org/jira/browse/ARROW-13317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 4.0.1
>Reporter: Arun Joseph
>Priority: Trivial
>  Labels: documentation, good-first-issue
> Fix For: 7.0.0
>
>
> The current documentation for 
> [read_feather|https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html]
>  states the following:
> *use_threads* (_bool__,_ _default True_) – Whether to parallelize reading 
> using multiple threads.
> if the underlying file uses compression, then multiple threads can still be 
> spawned. The verbiage of the *use_threads* is ambiguous on whether the 
> restriction on multiple threads is only for the conversion from pyarrow to 
> the pandas dataframe vs the reading/decompression of the file itself which 
> might spawn additional threads.
> [set_cpu_count|http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count]
>  might be good to mention as a way to actually limit threads spawned



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13436) [Python][Doc] Clarify what should be expected if read_table is passed an empty list of columns

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13436:
--
Labels: good-first-issue  (was: )

> [Python][Doc] Clarify what should be expected if read_table is passed an 
> empty list of columns
> --
>
> Key: ARROW-13436
> URL: https://issues.apache.org/jira/browse/ARROW-13436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: good-first-issue
>
> The documentation for pyarrow.parquet.read_table states:
>  
>  * *columns* (_list_) – If not None, only these columns will be read from the 
> file. A column name may be a prefix of a nested field, e.g. ‘a’ will select 
> ‘a.b’, ‘a.c’, and ‘a.d.e’.
>  
> It is not clear what should be the expected result if columns is an empty 
> list.  In pyarrow 3.0 this read in all columns (as long as 
> use_legacy_dataset=False).  In pyarrow 4.0 this doesn't read in any columns.  
> I think this behavior (not reading in any columns) is the correct behavior 
> (since None can be used for all columns) but we should clarify that in the 
> docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13525:
--
Fix Version/s: 6.0.0

> [Python] Mention alternatives in deprecation message of ParquetDataset 
> attributes
> -
>
> Key: ARROW-13525
> URL: https://issues.apache.org/jira/browse/ARROW-13525
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 5.0.1, 6.0.0
>
>
> Follow-up on ARROW-13074. 
> We should maybe also expose the {{partitioning}} attribute on ParquetDataset 
> (if constructed with {{use_legacy_dataset=False}}), as I did for the 
> {{filesystem}}/{{files}}/{{fragments}} attributes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13922) ParquetDataset throws error when len(path_or_paths) = 1

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13922:
--
Labels: good-second-issue  (was: )

> ParquetDataset throws error when len(path_or_paths) = 1
> ---
>
> Key: ARROW-13922
> URL: https://issues.apache.org/jira/browse/ARROW-13922
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Ashish Gupta
>Assignee: Weston Pace
>Priority: Major
>  Labels: good-second-issue
>
>  
> After updating pyarrow to version 5.0.0, ParquetDataset doesn't take a list 
> of length 1 for path_or_paths. Is this by design or a bug?
>  
> {code:java}
> In [1]: import pyarrow.parquet as pq
> In [2]: import pandas as pd
> In [3]: df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
> In [4]: df.to_parquet('test.parquet', index=False)
> In [5]: pq.ParquetDataset('test.parquet', 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()
> Out[5]:
>A  B
> 0  1  a
> 1  2  b
> 2  3  c
> In [6]: pq.ParquetDataset(['test.parquet'], 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()
> ---
> ValueErrorTraceback (most recent call last)
> ValueError: cannot construct a FileSource from a path without a FileSystem
> Exception ignored in: 'pyarrow._dataset._make_file_source'
> Traceback (most recent call last):
>   File 
> "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1676, in __init__
> fragment = parquet_format.make_fragment(single_file, filesystem)
> ValueError: cannot construct a FileSource from a path without a FileSystem
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.ParquetDataset(['test.parquet'], 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py
>  in __new__(cls, path_or_paths, filesystem, schema, metadata, 
> split_row_groups, validate_schema, filters, metadata_nthreads, 
> read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset, 
> pre_buffer, coerce_int96_timestamp_unit)
>1284
>1285 if not use_legacy_dataset:
> -> 1286 return _ParquetDatasetV2(
>1287 path_or_paths, filesystem=filesystem,
>1288 
> filters=filters,/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py
>  in __init__(self, path_or_paths, filesystem, filters, partitioning, 
> read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, 
> coerce_int96_timestamp_unit, **kwargs)
>1677
>1678 self._dataset = ds.FileSystemDataset(
> -> 1679 [fragment], schema=fragment.physical_schema,
>1680 format=parquet_format,
>1681 
> filesystem=fragment.filesystem/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Fragment.physical_schema.__get__()/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()ArrowInvalid: Called Open() on an uninitialized 
> FileSource
> In [7]: pq.ParquetDataset(['test.parquet', 'test.parquet'], 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()
> Out[7]:
>A  B
> 0  1  a
> 1  2  b
> 2  3  c
> 3  1  a
> 4  2  b
> 5  3  c
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13735) [Python] Creating a Map array with non-default field names segfaults

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13735:
--
Labels: good-second-issue  (was: )

> [Python] Creating a Map array with non-default field names segfaults
> 
>
> Key: ARROW-13735
> URL: https://issues.apache.org/jira/browse/ARROW-13735
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: good-second-issue
> Fix For: 6.0.0
>
>
> With ARROW-13696, you can create a MapType with non-default field names (the 
> default being "key" and "value"). 
> However, when then trying to create an array with it from python tuples, it 
> crashes:
> {code:python}
> >>> t = pa.map_(pa.field("name", "string", nullable=False), "int64")
> >>> pa.array([[('a', 1), ('b', 2)], [('c', 3)]], type=t)
> ../src/arrow/array/array_nested.cc:192:  Check failed: 
> self->list_type_->value_type()->Equals(data->child_data[0]->type) 
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b882)[0x7f298d497882]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b800)[0x7f298d497800]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b822)[0x7f298d497822]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow4util8ArrowLogD1Ev+0x47)[0x7f298d497b81]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xb39d31)[0x7f298d0c5d31]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x198)[0x7f298d0c06be]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArrayC1ERKSt10shared_ptrINS_9ArrayDataEE+0x64)[0x7f298d0bed14]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN9__gnu_cxx13new_allocatorIN5arrow8MapArrayEE9constructIS2_JRKSt10shared_ptrINS1_9ArrayDataEvPT_DpOT0_+0x49)[0x7f298d1a0f13]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt16allocator_traitsISaIN5arrow8MapArrayEEE9constructIS1_JRKSt10shared_ptrINS0_9ArrayDataEvRS2_PT_DpOT0_+0x38)[0x7f298d19ebe6]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt23_Sp_counted_ptr_inplaceIN5arrow8MapArrayESaIS1_ELN9__gnu_cxx12_Lock_policyE2EEC1IJRKSt10shared_ptrINS0_9ArrayDataES2_DpOT_+0xaf)[0x7f298d19b547]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN5arrow8MapArrayESaIS5_EJRKSt10shared_ptrINS4_9ArrayDataERPT_St20_Sp_alloc_shared_tagIT0_EDpOT1_+0xb2)[0x7f298d195a64]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt12__shared_ptrIN5arrow8MapArrayELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJRKSt10shared_ptrINS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x4c)[0x7f298d1918bc]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt10shared_ptrIN5arrow8MapArrayEEC1ISaIS1_EJRKS_INS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x39)[0x7f298d18f617]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt15allocate_sharedIN5arrow8MapArrayESaIS1_EJRKSt10shared_ptrINS0_9ArrayDataS3_IT_ERKT0_DpOT1_+0x38)[0x7f298d18d254]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt11make_sharedIN5arrow8MapArrayEJRKSt10shared_ptrINS0_9ArrayDataS2_IT_EDpOT0_+0x54)[0x7f298d1897b7]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbf5d6a)[0x7f298d181d6a]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbef0f3)[0x7f298d17b0f3]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x99)[0x7f298d173f6b]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEPSt10shared_ptrINS_5ArrayEE+0x115)[0x7f298d0e4ed9]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEv+0x47)[0x7f298d0e4fb7]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28cc91)[0x7f29d05d2c91]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x292774)[0x7f29d05d8774]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28ca00)[0x7f29d05d2a00]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x288f63)[0x7f29d05cef63]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(_ZN5arrow2py17ConvertPySequenceEP7_objectS2_NS0_19PyConversionOptionsEPNS_10MemoryPoolE+0xa9d)[0x7f29d05cadb7]
> /home/joris/scipy/repos/arrow/python/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c890d)[0x7f29d08f190d]
> /home/joris/miniconda3/envs/arrow-dev/bin/python(PyCFunction_Call+0x54)[0x5581d331a814]
> /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyObject_MakeTpCall+0x31e)[0x5581d332988e]
> 

[jira] [Updated] (ARROW-13735) [Python] Creating a Map array with non-default field names segfaults

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13735:
--
Labels:   (was: good-second-issue)

> [Python] Creating a Map array with non-default field names segfaults
> 
>
> Key: ARROW-13735
> URL: https://issues.apache.org/jira/browse/ARROW-13735
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 6.0.0
>
>
> With ARROW-13696, you can create a MapType with non-default field names (the 
> default being "key" and "value"). 
> However, when then trying to create an array with it from python tuples, it 
> crashes:
> {code:python}
> >>> t = pa.map_(pa.field("name", "string", nullable=False), "int64")
> >>> pa.array([[('a', 1), ('b', 2)], [('c', 3)]], type=t)
> ../src/arrow/array/array_nested.cc:192:  Check failed: 
> self->list_type_->value_type()->Equals(data->child_data[0]->type) 
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b882)[0x7f298d497882]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b800)[0x7f298d497800]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b822)[0x7f298d497822]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow4util8ArrowLogD1Ev+0x47)[0x7f298d497b81]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xb39d31)[0x7f298d0c5d31]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x198)[0x7f298d0c06be]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArrayC1ERKSt10shared_ptrINS_9ArrayDataEE+0x64)[0x7f298d0bed14]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN9__gnu_cxx13new_allocatorIN5arrow8MapArrayEE9constructIS2_JRKSt10shared_ptrINS1_9ArrayDataEvPT_DpOT0_+0x49)[0x7f298d1a0f13]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt16allocator_traitsISaIN5arrow8MapArrayEEE9constructIS1_JRKSt10shared_ptrINS0_9ArrayDataEvRS2_PT_DpOT0_+0x38)[0x7f298d19ebe6]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt23_Sp_counted_ptr_inplaceIN5arrow8MapArrayESaIS1_ELN9__gnu_cxx12_Lock_policyE2EEC1IJRKSt10shared_ptrINS0_9ArrayDataES2_DpOT_+0xaf)[0x7f298d19b547]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN5arrow8MapArrayESaIS5_EJRKSt10shared_ptrINS4_9ArrayDataERPT_St20_Sp_alloc_shared_tagIT0_EDpOT1_+0xb2)[0x7f298d195a64]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt12__shared_ptrIN5arrow8MapArrayELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJRKSt10shared_ptrINS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x4c)[0x7f298d1918bc]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt10shared_ptrIN5arrow8MapArrayEEC1ISaIS1_EJRKS_INS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x39)[0x7f298d18f617]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt15allocate_sharedIN5arrow8MapArrayESaIS1_EJRKSt10shared_ptrINS0_9ArrayDataS3_IT_ERKT0_DpOT1_+0x38)[0x7f298d18d254]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt11make_sharedIN5arrow8MapArrayEJRKSt10shared_ptrINS0_9ArrayDataS2_IT_EDpOT0_+0x54)[0x7f298d1897b7]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbf5d6a)[0x7f298d181d6a]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbef0f3)[0x7f298d17b0f3]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x99)[0x7f298d173f6b]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEPSt10shared_ptrINS_5ArrayEE+0x115)[0x7f298d0e4ed9]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEv+0x47)[0x7f298d0e4fb7]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28cc91)[0x7f29d05d2c91]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x292774)[0x7f29d05d8774]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28ca00)[0x7f29d05d2a00]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x288f63)[0x7f29d05cef63]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(_ZN5arrow2py17ConvertPySequenceEP7_objectS2_NS0_19PyConversionOptionsEPNS_10MemoryPoolE+0xa9d)[0x7f29d05cadb7]
> /home/joris/scipy/repos/arrow/python/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c890d)[0x7f29d08f190d]
> /home/joris/miniconda3/envs/arrow-dev/bin/python(PyCFunction_Call+0x54)[0x5581d331a814]
> /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyObject_MakeTpCall+0x31e)[0x5581d332988e]
> 

[jira] [Commented] (ARROW-14260) [C++] GTest linker error with vcpkg and Visual Studio 2019

2021-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427008#comment-17427008
 ] 

Antoine Pitrou commented on ARROW-14260:


The CI job is building GTest using vcpkg and the CMake is building GTest from 
source during the Arrow build process. There's probably a mismatch between the 
two versions.

> [C++] GTest linker error with vcpkg and Visual Studio 2019
> --
>
> Key: ARROW-14260
> URL: https://issues.apache.org/jira/browse/ARROW-14260
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> The *test-build-vcpkg-win* nightly Crossbow job is failing with these linker 
> errors:
> {code:java}
>  unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) void __cdecl 
> testing::internal2::PrintBytesInObjectTo(unsigned char const *,unsigned 
> __int64,class std::basic_ostream > *)" 
> (__imp_?PrintBytesInObjectTo@internal2@testing@@YAXPEBE_KPEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@@Z)
>  referenced in function "class std::basic_ostream std::char_traits > & __cdecl testing::internal2::operator<< std::char_traits,class std::_Vector_iterator std::_Vector_val 
> > > >(class std::basic_ostream > &,class 
> std::_Vector_iterator arrow::compute::ExecNode *> > > const &)" 
> (??$?6DU?$char_traits@D@std@@V?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@1@@internal2@testing@@YAAEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@AEAV23@AEBV?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@3@@Z)
>  
> unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> referenced in function "void __cdecl arrow::fs::AssertFileContents(class 
> arrow::fs::FileSystem *,class std::basic_string std::char_traits,class std::allocator > const &,class 
> std::basic_string,class 
> std::allocator > const &)" 
> (?AssertFileContents@fs@arrow@@YAXPEAVFileSystem@12@AEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@1@Z)
>  
> unity_0_cxx.obj : error LNK2001: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> {code}
> Link to the error where it occurs in the full log: 
> https://github.com/ursacomputing/crossbow/runs/3799925986#step:4:2737



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14260) [C++] GTest linker error with vcpkg and Visual Studio 2019

2021-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427007#comment-17427007
 ] 

Antoine Pitrou commented on ARROW-14260:


I would say it looks more like a GTest linking or packaging issue to me. cc 
[~kou] [~kszucs]

> [C++] GTest linker error with vcpkg and Visual Studio 2019
> --
>
> Key: ARROW-14260
> URL: https://issues.apache.org/jira/browse/ARROW-14260
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> The *test-build-vcpkg-win* nightly Crossbow job is failing with these linker 
> errors:
> {code:java}
>  unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) void __cdecl 
> testing::internal2::PrintBytesInObjectTo(unsigned char const *,unsigned 
> __int64,class std::basic_ostream > *)" 
> (__imp_?PrintBytesInObjectTo@internal2@testing@@YAXPEBE_KPEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@@Z)
>  referenced in function "class std::basic_ostream std::char_traits > & __cdecl testing::internal2::operator<< std::char_traits,class std::_Vector_iterator std::_Vector_val 
> > > >(class std::basic_ostream > &,class 
> std::_Vector_iterator arrow::compute::ExecNode *> > > const &)" 
> (??$?6DU?$char_traits@D@std@@V?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@1@@internal2@testing@@YAAEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@AEAV23@AEBV?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@3@@Z)
>  
> unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> referenced in function "void __cdecl arrow::fs::AssertFileContents(class 
> arrow::fs::FileSystem *,class std::basic_string std::char_traits,class std::allocator > const &,class 
> std::basic_string,class 
> std::allocator > const &)" 
> (?AssertFileContents@fs@arrow@@YAXPEAVFileSystem@12@AEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@1@Z)
>  
> unity_0_cxx.obj : error LNK2001: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> {code}
> Link to the error where it occurs in the full log: 
> https://github.com/ursacomputing/crossbow/runs/3799925986#step:4:2737



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14260) [C++] GTest linker error with vcpkg and Visual Studio 2019

2021-10-11 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-14260:
---
Summary: [C++] GTest linker error with vcpkg and Visual Studio 2019  (was: 
[C++] Linker error with Visual Studio 2019)

> [C++] GTest linker error with vcpkg and Visual Studio 2019
> --
>
> Key: ARROW-14260
> URL: https://issues.apache.org/jira/browse/ARROW-14260
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> The *test-build-vcpkg-win* nightly Crossbow job is failing with these linker 
> errors:
> {code:java}
>  unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) void __cdecl 
> testing::internal2::PrintBytesInObjectTo(unsigned char const *,unsigned 
> __int64,class std::basic_ostream > *)" 
> (__imp_?PrintBytesInObjectTo@internal2@testing@@YAXPEBE_KPEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@@Z)
>  referenced in function "class std::basic_ostream std::char_traits > & __cdecl testing::internal2::operator<< std::char_traits,class std::_Vector_iterator std::_Vector_val 
> > > >(class std::basic_ostream > &,class 
> std::_Vector_iterator arrow::compute::ExecNode *> > > const &)" 
> (??$?6DU?$char_traits@D@std@@V?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@1@@internal2@testing@@YAAEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@AEAV23@AEBV?$_Vector_iterator@V?$_Vector_val@U?$_Simple_types@PEAVExecNode@compute@arrow@@@std@@@std@@@3@@Z)
>  
> unity_1_cxx.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> referenced in function "void __cdecl arrow::fs::AssertFileContents(class 
> arrow::fs::FileSystem *,class std::basic_string std::char_traits,class std::allocator > const &,class 
> std::basic_string,class 
> std::allocator > const &)" 
> (?AssertFileContents@fs@arrow@@YAXPEAVFileSystem@12@AEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@1@Z)
>  
> unity_0_cxx.obj : error LNK2001: unresolved external symbol 
> "__declspec(dllimport) class testing::AssertionResult __cdecl 
> testing::internal::CmpHelperEQ(char const *,char const *,__int64,__int64)" 
> (__imp_?CmpHelperEQ@internal@testing@@YA?AVAssertionResult@2@PEBD0_J1@Z) 
> {code}
> Link to the error where it occurs in the full log: 
> https://github.com/ursacomputing/crossbow/runs/3799925986#step:4:2737



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11238) [Python] Make SubTreeFileSystem print method more informative

2021-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11238:
--
Labels: good-first-issue  (was: )

> [Python] Make SubTreeFileSystem print method more informative
> -
>
> Key: ARROW-11238
> URL: https://issues.apache.org/jira/browse/ARROW-11238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ian Cook
>Priority: Minor
>  Labels: good-first-issue
>
> The {{SubTreeFileSystem}} class does not have a {{\_\_str\_\_}} or 
> {{\_\_repr\_\_}} method. Define these methods to show useful information when 
> these objects are printed, such as a filesystem URI including scheme.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14213) R arrow package not working on RStudio/Ubuntu

2021-10-11 Thread Thomas Wutzler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426995#comment-17426995
 ] 

Thomas Wutzler commented on ARROW-14213:


The admins responded that these packages are installed with the following 
versions:

ii libcurl3:amd64 7.47.0-1ubuntu2.19 amd64 easy-to-use client-side URL transfer 
library (OpenSSL flavour)
ii libcurl3-gnutls:amd64 7.47.0-1ubuntu2.19 amd64 easy-to-use client-side URL 
transfer library (GnuTLS flavour)
ii libcurl4-gnutls-dev:amd64 7.47.0-1ubuntu2.19 amd64 development files and 
documentation for libcurl (GnuTLS flavour)
ii openssl 1.0.2g-1ubuntu4.20 amd64 Secure Sockets Layer toolkit - 
cryptographic utility

> R arrow package not working on RStudio/Ubuntu
> -
>
> Key: ARROW-14213
> URL: https://issues.apache.org/jira/browse/ARROW-14213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
> Copyright (C) 2020 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
>Reporter: Thomas Wutzler
>Priority: Major
>
> I try reading feather files in R with the arrow package that were generated 
> in Python.
> I run on R 3.6.3 on an RStudio server window on  linux machine, for which I 
> have no other access. I get the message:
>  {{Cannot call io___MemoryMappedFile__Open().}}
> According to the advice in the linked help-file: 
> [https://cran.r-project.org/web/packages/arrow/vignettes/install.html] I 
> create this issue with the full log of the installation:
> {{}}
> > arrow::install_arrow(verbose = TRUE)Installing package into 
> > '/Net/Groups/BGI/scratch/twutz/R/atacama-library/3.6'
> (as 'lib' is unspecified)trying URL 
> 'https://ftp5.gwdg.de/pub/misc/cran/src/contrib/arrow_5.0.0.2.tar.gz'Content 
> type 'application/octet-stream' length 483642 bytes (472 
> KB)==downloaded 472 KB* 
> installing *source* package 'arrow' ...** package 'arrow' successfully 
> unpacked and MD5 sums checked** using staged installationtrying URL 
> 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/ubuntu-16.04/arrow-5.0.0.2.zip'Content
>  type 'binary/octet-stream' length 17214781 bytes (16.4 
> MB)==downloaded 16.4 MB*** 
> Successfully retrieved C++ binaries for ubuntu-16.04
>  Binary package requires libcurl and openssl
>  If installation fails, retry after installing those system requirements
> PKG_CFLAGS=-I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include
>   -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3
> PKG_LIBS=-L/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/lib
>  -larrow_dataset -lparquet -larrow -larrow -larrow_bundled_dependencies 
> -larrow_dataset -lparquet -lssl -lcrypto -lcurl** libsg++ -std=gnu++11 
> -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c RTasks.cpp -o RTasks.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c altrep.cpp -o altrep.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c array.cpp -o array.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c array_to_vector.cpp -o array_to_vector.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security 

[jira] [Updated] (ARROW-14273) PlasmaClient::Contains should return false before the corresponding object is sealed

2021-10-11 Thread chimucong (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chimucong updated ARROW-14273:
--
Component/s: C++

> PlasmaClient::Contains should return false before the corresponding object is 
> sealed
> 
>
> Key: ARROW-14273
> URL: https://issues.apache.org/jira/browse/ARROW-14273
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: chimucong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14273) PlasmaClient::Contains should return false before the corresponding object is sealed

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14273:
---
Labels: pull-request-available  (was: )

> PlasmaClient::Contains should return false before the corresponding object is 
> sealed
> 
>
> Key: ARROW-14273
> URL: https://issues.apache.org/jira/browse/ARROW-14273
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: chimucong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14273) PlasmaClient::Contains should return false before the corresponding object is sealed

2021-10-11 Thread chimucong (Jira)
chimucong created ARROW-14273:
-

 Summary: PlasmaClient::Contains should return false before the 
corresponding object is sealed
 Key: ARROW-14273
 URL: https://issues.apache.org/jira/browse/ARROW-14273
 Project: Apache Arrow
  Issue Type: Bug
Reporter: chimucong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14272) PlasmaClient::Contains should return false before the corresponding object is sealed

2021-10-11 Thread chimucong (Jira)
chimucong created ARROW-14272:
-

 Summary: PlasmaClient::Contains should return false before the 
corresponding object is sealed
 Key: ARROW-14272
 URL: https://issues.apache.org/jira/browse/ARROW-14272
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Plasma
Reporter: chimucong


According to the 
doc([https://arrow.apache.org/docs/python/generated/pyarrow.plasma.PlasmaClient.html?highlight=contains#pyarrow.plasma.PlasmaClient.contains]),
 contains should return false before the corresponding object is sealed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14271) [Java] Inconsistent logic for type IDs in Union vectors

2021-10-11 Thread Roee Shlomo (Jira)
Roee Shlomo created ARROW-14271:
---

 Summary: [Java] Inconsistent logic for type IDs in Union vectors
 Key: ARROW-14271
 URL: https://issues.apache.org/jira/browse/ARROW-14271
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 6.0.0
Reporter: Roee Shlomo


The current logic for calculating the type IDs in UnionVector#getField and 
DenseUnionVector#getField is:
 # DenseUnionVector uses an increasing counter 
 # UnionVector uses the ordinal of the type enum
 # Both completely ignore the type IDs provided at construction as part of 
fieldType (if provided)

We encountered this inconsistency while testing a direct roundtrip of a union 
vector between pyarrow and Java with the C Data Interface ('direct' here means 
without using VectorSchemaRoot/RecordBatch). The identifiers for the type IDs 
differ after completing a roundtrip. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)