[jira] [Commented] (ARROW-14982) [C++][Python] Create utils for deep-copying an Array/ ArrayData

2021-12-03 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453258#comment-17453258
 ] 

David Li commented on ARROW-14982:
--

We could perhaps wrap Concatenate in a kernel and name it deep_copy. (I think 
Concatenate on a single array suffices to copy the data.)

> [C++][Python] Create utils for deep-copying an Array/ ArrayData
> ---
>
> Key: ARROW-14982
> URL: https://issues.apache.org/jira/browse/ARROW-14982
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Niranda Perera
>Priority: Major
>
> Hi, I would like to request a util to deep copy an Array with the following 
> semantic. 
>  
> {code:java}
> Result> DeepCopyArrayData(const ArrayData& arr, int64 
> offset = 0, int64 length = -1/*, pool=...*/);
> {code}
>  
> This was discussed some time back in Zulip [1]. 
>  
> [1] 
> [https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13672) [C++] BinaryBuilder doesn't preserve passed in DataType

2021-12-03 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453257#comment-17453257
 ] 

David Li commented on ARROW-13672:
--

[~supun] probably the builder should store the given type and use it in 
[FinishInternal|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L317]
 (and presumably in 
[type()|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390]).

> [C++] BinaryBuilder doesn't preserve passed in DataType
> ---
>
> Key: ARROW-13672
> URL: https://issues.apache.org/jira/browse/ARROW-13672
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: Micah Kornfield
>Assignee: Supun Kamburugamuva
>Priority: Minor
>  Labels: beginner, good-first-issue
>
> There is a 
> [constructor|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L56]
>  that takes a datatype for binary builder but it is discarded.  When 
> constructing an Array the type is always the value returned from type() 
> [binary|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390]
> If a consumer of the API wants to have an extension array this prevents them 
> from passing the extension type though.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14982) [C++][Python] Create utils for deep-copying an Array/ ArrayData

2021-12-03 Thread Niranda Perera (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranda Perera updated ARROW-14982:
---
Description: 
Hi, I would like to request a util to deep copy an Array with the following 
semantic. 

 
{code:java}
Result> DeepCopyArrayData(const ArrayData& arr, int64 
offset = 0, int64 length = -1/*, pool=...*/);
{code}
 

This was discussed some time back in Zulip [1]. 

 

[1] 
[https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers]

  was:
Hi, I would like to request a util to deep copy an Array with the following 
semantic. 

 
{code:java}
Result> DeepCopyArrayData(const ArrayData& arr, int64 
offset = 0, int64 length = arr->length()/*, pool=...*/);
{code}
 

This was discussed some time back in Zulip [1]. 

 

[1] 
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers


> [C++][Python] Create utils for deep-copying an Array/ ArrayData
> ---
>
> Key: ARROW-14982
> URL: https://issues.apache.org/jira/browse/ARROW-14982
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Niranda Perera
>Priority: Major
>
> Hi, I would like to request a util to deep copy an Array with the following 
> semantic. 
>  
> {code:java}
> Result> DeepCopyArrayData(const ArrayData& arr, int64 
> offset = 0, int64 length = -1/*, pool=...*/);
> {code}
>  
> This was discussed some time back in Zulip [1]. 
>  
> [1] 
> [https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14982) [C++][Python] Create utils for deep-copying an Array/ ArrayData

2021-12-03 Thread Niranda Perera (Jira)
Niranda Perera created ARROW-14982:
--

 Summary: [C++][Python] Create utils for deep-copying an Array/ 
ArrayData
 Key: ARROW-14982
 URL: https://issues.apache.org/jira/browse/ARROW-14982
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Niranda Perera


Hi, I would like to request a util to deep copy an Array with the following 
semantic. 

 
{code:java}
Result> DeepCopyArrayData(const ArrayData& arr, int64 
offset = 0, int64 length = arr->length()/*, pool=...*/);
{code}
 

This was discussed some time back in Zulip [1]. 

 

[1] 
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-13672) [C++] BinaryBuilder doesn't preserve passed in DataType

2021-12-03 Thread Supun Kamburugamuva (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Supun Kamburugamuva reassigned ARROW-13672:
---

Assignee: Supun Kamburugamuva

> [C++] BinaryBuilder doesn't preserve passed in DataType
> ---
>
> Key: ARROW-13672
> URL: https://issues.apache.org/jira/browse/ARROW-13672
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: Micah Kornfield
>Assignee: Supun Kamburugamuva
>Priority: Minor
>  Labels: beginner, good-first-issue
>
> There is a 
> [constructor|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L56]
>  that takes a datatype for binary builder but it is discarded.  When 
> constructing an Array the type is always the value returned from type() 
> [binary|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390]
> If a consumer of the API wants to have an extension array this prevents them 
> from passing the extension type though.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13672) [C++] BinaryBuilder doesn't preserve passed in DataType

2021-12-03 Thread Supun Kamburugamuva (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453217#comment-17453217
 ] 

Supun Kamburugamuva commented on ARROW-13672:
-

Should the solution be that we remove passing the type to the constructor?

> [C++] BinaryBuilder doesn't preserve passed in DataType
> ---
>
> Key: ARROW-13672
> URL: https://issues.apache.org/jira/browse/ARROW-13672
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: Micah Kornfield
>Priority: Minor
>  Labels: beginner, good-first-issue
>
> There is a 
> [constructor|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L56]
>  that takes a datatype for binary builder but it is discarded.  When 
> constructing an Array the type is always the value returned from type() 
> [binary|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390]
> If a consumer of the API wants to have an extension array this prevents them 
> from passing the extension type though.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14981) [CI][Docs] Upload built documents

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14981:
---
Labels: pull-request-available  (was: )

> [CI][Docs] Upload built documents
> -
>
> Key: ARROW-14981
> URL: https://issues.apache.org/jira/browse/ARROW-14981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Documentation
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We can use this in release process instead of building on release manager's 
> local environment.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14981) [CI][Docs] Upload built documents

2021-12-03 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-14981:


 Summary: [CI][Docs] Upload built documents
 Key: ARROW-14981
 URL: https://issues.apache.org/jira/browse/ARROW-14981
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Documentation
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


We can use this in release process instead of building on release manager's 
local environment.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-13735) [Python] Creating a Map array with non-default field names segfaults

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13735:
---
Labels: pull-request-available  (was: )

> [Python] Creating a Map array with non-default field names segfaults
> 
>
> Key: ARROW-13735
> URL: https://issues.apache.org/jira/browse/ARROW-13735
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With ARROW-13696, you can create a MapType with non-default field names (the 
> default being "key" and "value"). 
> However, when then trying to create an array with it from python tuples, it 
> crashes:
> {code:python}
> >>> t = pa.map_(pa.field("name", "string", nullable=False), "int64")
> >>> pa.array([[('a', 1), ('b', 2)], [('c', 3)]], type=t)
> ../src/arrow/array/array_nested.cc:192:  Check failed: 
> self->list_type_->value_type()->Equals(data->child_data[0]->type) 
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b882)[0x7f298d497882]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b800)[0x7f298d497800]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b822)[0x7f298d497822]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow4util8ArrowLogD1Ev+0x47)[0x7f298d497b81]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xb39d31)[0x7f298d0c5d31]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x198)[0x7f298d0c06be]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArrayC1ERKSt10shared_ptrINS_9ArrayDataEE+0x64)[0x7f298d0bed14]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN9__gnu_cxx13new_allocatorIN5arrow8MapArrayEE9constructIS2_JRKSt10shared_ptrINS1_9ArrayDataEvPT_DpOT0_+0x49)[0x7f298d1a0f13]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt16allocator_traitsISaIN5arrow8MapArrayEEE9constructIS1_JRKSt10shared_ptrINS0_9ArrayDataEvRS2_PT_DpOT0_+0x38)[0x7f298d19ebe6]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt23_Sp_counted_ptr_inplaceIN5arrow8MapArrayESaIS1_ELN9__gnu_cxx12_Lock_policyE2EEC1IJRKSt10shared_ptrINS0_9ArrayDataES2_DpOT_+0xaf)[0x7f298d19b547]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN5arrow8MapArrayESaIS5_EJRKSt10shared_ptrINS4_9ArrayDataERPT_St20_Sp_alloc_shared_tagIT0_EDpOT1_+0xb2)[0x7f298d195a64]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt12__shared_ptrIN5arrow8MapArrayELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJRKSt10shared_ptrINS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x4c)[0x7f298d1918bc]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt10shared_ptrIN5arrow8MapArrayEEC1ISaIS1_EJRKS_INS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x39)[0x7f298d18f617]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt15allocate_sharedIN5arrow8MapArrayESaIS1_EJRKSt10shared_ptrINS0_9ArrayDataS3_IT_ERKT0_DpOT1_+0x38)[0x7f298d18d254]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt11make_sharedIN5arrow8MapArrayEJRKSt10shared_ptrINS0_9ArrayDataS2_IT_EDpOT0_+0x54)[0x7f298d1897b7]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbf5d6a)[0x7f298d181d6a]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbef0f3)[0x7f298d17b0f3]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x99)[0x7f298d173f6b]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEPSt10shared_ptrINS_5ArrayEE+0x115)[0x7f298d0e4ed9]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEv+0x47)[0x7f298d0e4fb7]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28cc91)[0x7f29d05d2c91]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x292774)[0x7f29d05d8774]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28ca00)[0x7f29d05d2a00]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x288f63)[0x7f29d05cef63]
> /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(_ZN5arrow2py17ConvertPySequenceEP7_objectS2_NS0_19PyConversionOptionsEPNS_10MemoryPoolE+0xa9d)[0x7f29d05cadb7]
> /home/joris/scipy/repos/arrow/python/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c890d)[0x7f29d08f190d]
> /home/joris/miniconda3/envs/arrow-dev/bin/python(PyCFunction_Call+0x54)[0x5581d331a814]
> /home/joris/miniconda

[jira] [Updated] (ARROW-12629) [C++] Configurable read-ahead in CSV and JSON readers

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12629:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [C++] Configurable read-ahead in CSV and JSON readers
> -
>
> Key: ARROW-12629
> URL: https://issues.apache.org/jira/browse/ARROW-12629
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Andre Kohn
>Assignee: Supun Kamburugamuva
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We are compiling Arrow C++ to WebAssembly and ran into the following issue 
> with the CSV reader:
> Browsers became very picky about the use of SharedArrayBuffers after the 
> events around Spectre and Meltdown.
> As a result, you have to compile Arrow to WebAssembly without threads if you 
> don't want to run your website with very strict cross-origin isolation.
> Unfortunately, the CSV reader seems to always spawn a thread for the 
> read-ahead in both, the SerialStreamingReader and the SerialTableReader 
> independent of whether use_threads is set.
> Right now, this effectively means that you cannot use the CSV (and JSON) 
> readers in threadless WebAssembly builds.
>  
> [https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L839]
> [https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L913]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14879) [Python][Packaging] Remove manylinux2010 wheels

2021-12-03 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook resolved ARROW-14879.
--
Fix Version/s: (was: 7.0.0)
   Resolution: Won't Fix

Closing as "won't fix" for now.

Quoting [~kszucs]:
{quote}I managed to [patch[elf] the vcpkg binary with glibc 
2.18|https://github.com/apache/arrow/pull/11569/files#diff-6f627de1c3985ba6a76addecbcb7b3cb88ca0ca6244d91339a348cf4d48db914R52-R53]
 in the manylinux2010 image, so we should be able to maintain the builds until 
the vcpkg dependencies compile.
 
Closing in favor of 
[{{2233ac5}}|https://github.com/apache/arrow/commit/2233ac5782e52015d1f51ac3f7dd201c1262a947]
{quote}
 

> [Python][Packaging] Remove manylinux2010 wheels
> ---
>
> Key: ARROW-14879
> URL: https://issues.apache.org/jira/browse/ARROW-14879
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> More recent vcpkg is not compatible with older glibc shipped by manylinux2010 
> so we won't be able to regularly update the dependencies. Besides that 
> manylinux2010 has reached EOL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13035) [C++] Create a compute function returning indices of non-zero values

2021-12-03 Thread Niranda Perera (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453190#comment-17453190
 ] 

Niranda Perera commented on ARROW-13035:


[~amol-] Shall I take it up? I think it is fairly straight forward.

> [C++] Create a compute function returning indices of non-zero values
> 
>
> Key: ARROW-13035
> URL: https://issues.apache.org/jira/browse/ARROW-13035
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: good-first-issue
>
> This would be similar to Numpy's {{nonzero}} function:
> [https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html]
> {code:python}
> >>> arr = np.array([4,5,0,6,0,5])
> >>> np.nonzero(arr)
> (array([0, 1, 3, 5]),)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453188#comment-17453188
 ] 

Joris Van den Bossche commented on ARROW-13168:
---

The same is the case for python: https://pypi.org/project/tzdata/

Ideally we would be able to specify the path at runtime, I think.

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453186#comment-17453186
 ] 

Dewey Dunnington commented on ARROW-13168:
--

We can force it to be present at build-time, force it to be present at install 
time or force it to be present at runtime (probably what we'd do).

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14961) [C++] Bump version on Google Benchmark

2021-12-03 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook resolved ARROW-14961.
--
Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11833
[https://github.com/apache/arrow/pull/11833]

> [C++] Bump version on Google Benchmark
> --
>
> Key: ARROW-14961
> URL: https://issues.apache.org/jira/browse/ARROW-14961
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Benchmarking, C++
>Reporter: Sasha Krassovsky
>Assignee: Sasha Krassovsky
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Google Benchmark v1.6.0 came out - I'd like to use a couple of functions it 
> provides in a different issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-1699) [C++] Forward, backward fill kernel functions

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1699:
--
Labels: analytics dataframe pull-request-available  (was: analytics 
dataframe)

> [C++] Forward, backward fill kernel functions
> -
>
> Key: ARROW-1699
> URL: https://issues.apache.org/jira/browse/ARROW-1699
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Alvin Chunga Mamani
>Priority: Major
>  Labels: analytics, dataframe, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Like ffill / bfill in pandas (with limit)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14979) [C++] GCS integration tests leak processes

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14979:
---
Labels: pull-request-available  (was: )

> [C++] GCS integration tests leak processes
> --
>
> Key: ARROW-14979
> URL: https://issues.apache.org/jira/browse/ARROW-14979
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The GCS integration tests fail to fully shutdown all the processes created to 
> run the GCS testbench.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14980) [C++] GcsFileSystem tests should use PYTHON environment variable

2021-12-03 Thread Carlos O'Ryan (Jira)
Carlos O'Ryan created ARROW-14980:
-

 Summary: [C++] GcsFileSystem tests should use PYTHON environment 
variable
 Key: ARROW-14980
 URL: https://issues.apache.org/jira/browse/ARROW-14980
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Carlos O'Ryan






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14506) [C++] Add GCS library to conda files

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14506:
---
Labels: pull-request-available  (was: )

> [C++] Add GCS library to conda files
> 
>
> Key: ARROW-14506
> URL: https://issues.apache.org/jira/browse/ARROW-14506
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It seems that the conda package for `google-cloud-cpp` is not usable:
> https://github.com/conda-forge/google-cloud-cpp-feedstock/issues/68
> this is a reminder to add the dependency once the package is fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14979) [C++] GCS integration tests leak processes

2021-12-03 Thread Carlos O'Ryan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos O'Ryan reassigned ARROW-14979:
-

Assignee: Carlos O'Ryan

> [C++] GCS integration tests leak processes
> --
>
> Key: ARROW-14979
> URL: https://issues.apache.org/jira/browse/ARROW-14979
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>
> The GCS integration tests fail to fully shutdown all the processes created to 
> run the GCS testbench.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14979) [C++] GCS integration tests leak processes

2021-12-03 Thread Carlos O'Ryan (Jira)
Carlos O'Ryan created ARROW-14979:
-

 Summary: [C++] GCS integration tests leak processes
 Key: ARROW-14979
 URL: https://issues.apache.org/jira/browse/ARROW-14979
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Carlos O'Ryan


The GCS integration tests fail to fully shutdown all the processes created to 
run the GCS testbench.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14506) [C++] Add GCS library to conda files

2021-12-03 Thread Carlos O'Ryan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos O'Ryan reassigned ARROW-14506:
-

Assignee: Carlos O'Ryan

> [C++] Add GCS library to conda files
> 
>
> Key: ARROW-14506
> URL: https://issues.apache.org/jira/browse/ARROW-14506
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>
> It seems that the conda package for `google-cloud-cpp` is not usable:
> https://github.com/conda-forge/google-cloud-cpp-feedstock/issues/68
> this is a reminder to add the dependency once the package is fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14941) [R] Implement Duration R6 class and bindings for lubridate::duration()

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14941:
---
Labels: pull-request-available  (was: )

> [R] Implement Duration R6 class and bindings for lubridate::duration()
> --
>
> Key: ARROW-14941
> URL: https://issues.apache.org/jira/browse/ARROW-14941
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453165#comment-17453165
 ] 

Rok Mihevc commented on ARROW-13168:


Ah, I see your point. Neat! Is tzdb package always present?

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453161#comment-17453161
 ] 

Dewey Dunnington commented on ARROW-13168:
--

For sure you don't want to depend on it being present from C++! From our end on 
the R bindings, though, it means we can support Windows through the tzdb R 
package (if C++ lets us point at a directory with a database in the right 
format).

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1

2021-12-03 Thread Eric Henry (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Henry updated ARROW-14978:
---
Description: 
*Description:*

Unable to import flight module in pyarrow in version 6.0.0 and 6.0.1.  Is there 
a different package I can install that provides the module? Or is this 
potentially an issue with me using apple silicon?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File "/path/to/repo/pyarrow/flight.py", line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 

*Environment*

MacBook Pro (Apple Silicon)

  was:
*Description:*

Unable to import flight module in pyarrow since version 6.0.0.  Is there a 
different package I can install that provides the module? Or is this 
potentially an issue with me using apple silicon?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File "/path/to/repo/pyarrow/flight.py", line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 

*Environment*

MacBook Pro (Apple Silicon)


> Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
> --
>
> Key: ARROW-14978
> URL: https://issues.apache.org/jira/browse/ARROW-14978
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 6.0.0, 6.0.1
>Reporter: Eric Henry
>Priority: Major
>
> *Description:*
> Unable to import flight module in pyarrow in version 6.0.0 and 6.0.1.  Is 
> there a different package I can install that provides the module? Or is this 
> potentially an issue with me using apple silicon?
>  
> *Error:*
> >>>import pyarrow.flight
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/path/to/repo/pyarrow/flight.py", line 18, in 
>     from pyarrow._flight import (  # noqa:F401
> ModuleNotFoundError: No module named 'pyarrow._flight'
>  
> *How to replicate*
> pip3.8 install pyarrow==6.0.1
> python3.8 -c 'import pyarrow.flight'
>  
> *Python version:*
> Python version: 3.8
>  
> *Environment*
> MacBook Pro (Apple Silicon)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1

2021-12-03 Thread Eric Henry (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Henry updated ARROW-14978:
---
Description: 
*Description:*

Unable to import flight module in pyarrow since version 6.0.0.  Is there a 
different package I can install that provides the module? Or is this 
potentially an issue with me using apple silicon?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File "/path/to/repo/pyarrow/flight.py", line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 

*Environment*

MacBook Pro (Apple Silicon)

  was:
*Description:*

Unable to import flight module in pyarrow since version 6.0.0. This has been 
preventing me from upgrading my version of pyarrow. Will this no longer be 
included in the wheel? If so, is there a Jira ticket or announcement so I can 
get more information? Is there a different package I can install that provides 
the module? Or is this potentially an issue with me using apple silicon?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File "/path/to/repo/pyarrow/flight.py", line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 

*Environment*

MacBook Pro (Apple Silicon)


> Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
> --
>
> Key: ARROW-14978
> URL: https://issues.apache.org/jira/browse/ARROW-14978
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 6.0.0, 6.0.1
>Reporter: Eric Henry
>Priority: Major
>
> *Description:*
> Unable to import flight module in pyarrow since version 6.0.0.  Is there a 
> different package I can install that provides the module? Or is this 
> potentially an issue with me using apple silicon?
>  
> *Error:*
> >>>import pyarrow.flight
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/path/to/repo/pyarrow/flight.py", line 18, in 
>     from pyarrow._flight import (  # noqa:F401
> ModuleNotFoundError: No module named 'pyarrow._flight'
>  
> *How to replicate*
> pip3.8 install pyarrow==6.0.1
> python3.8 -c 'import pyarrow.flight'
>  
> *Python version:*
> Python version: 3.8
>  
> *Environment*
> MacBook Pro (Apple Silicon)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14903) [C++] Enable CSV Writer to control string to be used for missing data

2021-12-03 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-14903.
--
Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11836
[https://github.com/apache/arrow/pull/11836]

> [C++] Enable CSV Writer to control string to be used for missing data
> -
>
> Key: ARROW-14903
> URL: https://issues.apache.org/jira/browse/ARROW-14903
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Johan Peltenburg
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This would allow the user to control how missing values are written to a CSV 
> file using the R {{write_csv_arrow()}} functionality.
> {{{}na{}}}: string used for missing values. Defaults to {{{}NA{}}}. Missing 
> values are never quoted; strings with the same value as {{na}} will always be 
> quoted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1

2021-12-03 Thread Eric Henry (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Henry updated ARROW-14978:
---
Description: 
*Description:*

Unable to import flight module in pyarrow since version 6.0.0. This has been 
preventing me from upgrading my version of pyarrow. Will this no longer be 
included in the wheel? If so, is there a Jira ticket or announcement so I can 
get more information? Is there a different package I can install that provides 
the module? Or is this potentially an issue with me using apple silicon?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File "/path/to/repo/pyarrow/flight.py", line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 

*Environment*

MacBook Pro (Apple Silicon)

  was:
*Description:*

Unable to import flight module in pyarrow since version 6.0.0. This has been 
preventing me from upgrading my version of pyarrow. Will this no longer be 
included in the wheel? If so, is there a Jira ticket or announcement so I can 
get more information? Is there a different package I can install that provides 
the module? Or is this potentially an issue with me using apple silicon?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py",
 line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 

*Environment*

MacBook Pro (Apple Silicon)


> Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
> --
>
> Key: ARROW-14978
> URL: https://issues.apache.org/jira/browse/ARROW-14978
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 6.0.0, 6.0.1
>Reporter: Eric Henry
>Priority: Major
>
> *Description:*
> Unable to import flight module in pyarrow since version 6.0.0. This has been 
> preventing me from upgrading my version of pyarrow. Will this no longer be 
> included in the wheel? If so, is there a Jira ticket or announcement so I can 
> get more information? Is there a different package I can install that 
> provides the module? Or is this potentially an issue with me using apple 
> silicon?
>  
> *Error:*
> >>>import pyarrow.flight
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/path/to/repo/pyarrow/flight.py", line 18, in 
>     from pyarrow._flight import (  # noqa:F401
> ModuleNotFoundError: No module named 'pyarrow._flight'
>  
> *How to replicate*
> pip3.8 install pyarrow==6.0.1
> python3.8 -c 'import pyarrow.flight'
>  
> *Python version:*
> Python version: 3.8
>  
> *Environment*
> MacBook Pro (Apple Silicon)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1

2021-12-03 Thread Eric Henry (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Henry updated ARROW-14978:
---
Description: 
*Description:*

Unable to import flight module in pyarrow since version 6.0.0. This has been 
preventing me from upgrading my version of pyarrow. Will this no longer be 
included in the wheel? If so, is there a Jira ticket or announcement so I can 
get more information? Is there a different package I can install that provides 
the module? Or is this potentially an issue with me using apple silicon?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py",
 line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 

*Environment*

MacBook Pro (Apple Silicon)

  was:
*Description:*

Unable to import flight module in pyarrow since version 6.0.0. This has been 
preventing me from upgrading my version of pyarrow. Will this no longer be 
included in the wheel? If so, is there a Jira ticket or announcement so I can 
get more information? Is there a different package I can install that provides 
the module?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py",
 line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 


> Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
> --
>
> Key: ARROW-14978
> URL: https://issues.apache.org/jira/browse/ARROW-14978
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 6.0.0, 6.0.1
>Reporter: Eric Henry
>Priority: Major
>
> *Description:*
> Unable to import flight module in pyarrow since version 6.0.0. This has been 
> preventing me from upgrading my version of pyarrow. Will this no longer be 
> included in the wheel? If so, is there a Jira ticket or announcement so I can 
> get more information? Is there a different package I can install that 
> provides the module? Or is this potentially an issue with me using apple 
> silicon?
>  
> *Error:*
> >>>import pyarrow.flight
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py",
>  line 18, in 
>     from pyarrow._flight import (  # noqa:F401
> ModuleNotFoundError: No module named 'pyarrow._flight'
>  
> *How to replicate*
> pip3.8 install pyarrow==6.0.1
> python3.8 -c 'import pyarrow.flight'
>  
> *Python version:*
> Python version: 3.8
>  
> *Environment*
> MacBook Pro (Apple Silicon)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1

2021-12-03 Thread Eric Henry (Jira)
Eric Henry created ARROW-14978:
--

 Summary: Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
 Key: ARROW-14978
 URL: https://issues.apache.org/jira/browse/ARROW-14978
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 6.0.1, 6.0.0
Reporter: Eric Henry


*Description:*

Unable to import flight module in pyarrow since version 6.0.0. This has been 
preventing me from upgrading my version of pyarrow. Will this no longer be 
included in the wheel? If so, is there a Jira ticket or announcement so I can 
get more information? Is there a different package I can install that provides 
the module?

 

*Error:*

>>>import pyarrow.flight
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py",
 line 18, in 
    from pyarrow._flight import (  # noqa:F401
ModuleNotFoundError: No module named 'pyarrow._flight'

 

*How to replicate*

pip3.8 install pyarrow==6.0.1

python3.8 -c 'import pyarrow.flight'

 

*Python version:*

Python version: 3.8

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453136#comment-17453136
 ] 

Rok Mihevc commented on ARROW-13168:


Indeed it appears to be [~willjones127]. It also seems to be doing the bundling 
approach - including the timezone data with the install - 
[https://github.com/r-lib/tzdb/tree/main/inst/tzdata]

We also use date.h and tz.h (https://github.com/HowardHinnant/date#summary) in 
Arrow for almost all time related things.

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453130#comment-17453130
 ] 

Will Jones commented on ARROW-13168:


It looks like that tzdb is a wrapper around this C++ library: 
[https://github.com/HowardHinnant/date]

I also saw that same library recommended in [this StackOverflow answer by a 
Microsoft employee about Windows TZ 
libraries|https://stackoverflow.com/a/47106207/2048858].


> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14905) [C++] Enable CSV Writer to handle quoting

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14905:
---
Labels: pull-request-available  (was: )

> [C++] Enable CSV Writer to handle quoting
> -
>
> Key: ARROW-14905
> URL: https://issues.apache.org/jira/browse/ARROW-14905
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Johan Peltenburg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will allow more control over quoting. In {{readr::write_csv()}} 
> {{{}quote{}}} instructs on how to handle fields which contain characters that 
> need to be quoted: 
>  * {{{}needed{}}}: only quote fields which need them
>  * {{{}all{}}}: quote all fields - I think this might be the implicit default 
> behaviour for {{write_csv_arrow()}}
>  * {{{}none{}}}: never quote fields



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453113#comment-17453113
 ] 

Rok Mihevc commented on ARROW-13168:


Nice to see interest on this [~paleolimbot]! I will probably pick it up next.

I'm not sure we want to depend on R/Python libraries being present from C++, 
but I'll look into this option, thanks!

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13035) [C++] Create a compute function returning indices of non-zero values

2021-12-03 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453109#comment-17453109
 ] 

Alessandro Molina commented on ARROW-13035:
---

I just took it the other day to get remember tackling it,  but I haven't yet 
done any work as I'm finishing another one. I hope to be able to include it in 
7.0.0 btw

> [C++] Create a compute function returning indices of non-zero values
> 
>
> Key: ARROW-13035
> URL: https://issues.apache.org/jira/browse/ARROW-13035
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: good-first-issue
>
> This would be similar to Numpy's {{nonzero}} function:
> [https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html]
> {code:python}
> >>> arr = np.array([4,5,0,6,0,5])
> >>> np.nonzero(arr)
> (array([0, 1, 3, 5]),)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-13016) [C++] Support Null type in Sum/Mean aggregation

2021-12-03 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-13016.
--
Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 10486
[https://github.com/apache/arrow/pull/10486]

> [C++] Support Null type in Sum/Mean aggregation
> ---
>
> Key: ARROW-13016
> URL: https://issues.apache.org/jira/browse/ARROW-13016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Chenxi Li
>Assignee: Chenxi Li
>Priority: Minor
>  Labels: kernel, pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-13016) [C++] Support Null type in Sum/Mean aggregation

2021-12-03 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13016:
-
Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Support Null type in Sum/Mean aggregation
> ---
>
> Key: ARROW-13016
> URL: https://issues.apache.org/jira/browse/ARROW-13016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Chenxi Li
>Assignee: Chenxi Li
>Priority: Minor
>  Labels: kernel, pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-13811) [Java] Provide a general out-of-place sorter

2021-12-03 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan resolved ARROW-13811.
--
Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11035
[https://github.com/apache/arrow/pull/11035]

> [Java] Provide a general out-of-place sorter
> 
>
> Key: ARROW-13811
> URL: https://issues.apache.org/jira/browse/ARROW-13811
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The sorter should work for any type of vectors, with a time complexity of 
> O(n*log( n )). 
> Since it does not make any assumptions about the memory layout of the vector, 
> its performance can be sub-optimal. So if another sorter is applicable 
> (e.g.{{FixedWidthInPlaceVectorSorter}}), it should be used in preference. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453018#comment-17453018
 ] 

Dewey Dunnington commented on ARROW-13168:
--

Just a +1 for a "runtime configuration" option. In R we have the [tzdb 
package|]. Currently it only provides the text format of the IANA database but 
we could use that approach if we need something different (maintained 
sepaerately to keep it up to date). I'm less familiar with Python but I imagine 
something similar exist there, too.

{code:R}
list.files(tzdb::tzdb_path("text"))
#>  [1] "africa""antarctica""asia" 
#>  [4] "australasia"   "backward"  "backzone" 
#>  [7] "calendars" "checklinks.awk""checktab.awk" 
#> [10] "CONTRIBUTING"  "etcetera"  "europe"   
#> [13] "factory"   "iso3166.tab"   "leap-seconds.list"
#> [16] "leapseconds"   "leapseconds.awk"   "LICENSE"  
#> [19] "Makefile"  "NEWS"  "northamerica" 
#> [22] "README""southamerica"  "theory.html"  
#> [25] "version"   "windowsZones.xml"  "ziguard.awk"  
#> [28] "zishrink.awk"  "zone.tab"  "zone1970.tab" 
#> [31] "zoneinfo2tdf.pl"
{code}


> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-13168) [C++] Timezone database configuration and access

2021-12-03 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453018#comment-17453018
 ] 

Dewey Dunnington edited comment on ARROW-13168 at 12/3/21, 1:53 PM:


Just a +1 for a "runtime configuration" option. In R we have the [tzdb 
package|https://github.com/r-lib/tzdb]. Currently it only provides the text 
format of the IANA database but we could use that approach if we need something 
different (maintained sepaerately to keep it up to date). I'm less familiar 
with Python but I imagine something similar exist there, too.

{code:R}
list.files(tzdb::tzdb_path("text"))
#>  [1] "africa""antarctica""asia" 
#>  [4] "australasia"   "backward"  "backzone" 
#>  [7] "calendars" "checklinks.awk""checktab.awk" 
#> [10] "CONTRIBUTING"  "etcetera"  "europe"   
#> [13] "factory"   "iso3166.tab"   "leap-seconds.list"
#> [16] "leapseconds"   "leapseconds.awk"   "LICENSE"  
#> [19] "Makefile"  "NEWS"  "northamerica" 
#> [22] "README""southamerica"  "theory.html"  
#> [25] "version"   "windowsZones.xml"  "ziguard.awk"  
#> [28] "zishrink.awk"  "zone.tab"  "zone1970.tab" 
#> [31] "zoneinfo2tdf.pl"
{code}



was (Author: paleolimbot):
Just a +1 for a "runtime configuration" option. In R we have the [tzdb 
package|]. Currently it only provides the text format of the IANA database but 
we could use that approach if we need something different (maintained 
sepaerately to keep it up to date). I'm less familiar with Python but I imagine 
something similar exist there, too.

{code:R}
list.files(tzdb::tzdb_path("text"))
#>  [1] "africa""antarctica""asia" 
#>  [4] "australasia"   "backward"  "backzone" 
#>  [7] "calendars" "checklinks.awk""checktab.awk" 
#> [10] "CONTRIBUTING"  "etcetera"  "europe"   
#> [13] "factory"   "iso3166.tab"   "leap-seconds.list"
#> [16] "leapseconds"   "leapseconds.awk"   "LICENSE"  
#> [19] "Makefile"  "NEWS"  "northamerica" 
#> [22] "README""southamerica"  "theory.html"  
#> [25] "version"   "windowsZones.xml"  "ziguard.awk"  
#> [28] "zishrink.awk"  "zone.tab"  "zone1970.tab" 
#> [31] "zoneinfo2tdf.pl"
{code}


> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14843) [R] Implement decimal128() (to replace decimal())

2021-12-03 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14843.

Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11790
[https://github.com/apache/arrow/pull/11790]

> [R] Implement decimal128() (to replace decimal())
> -
>
> Key: ARROW-14843
> URL: https://issues.apache.org/jira/browse/ARROW-14843
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14905) [C++] Enable CSV Writer to handle quoting

2021-12-03 Thread Johan Peltenburg (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453003#comment-17453003
 ] 

Johan Peltenburg edited comment on ARROW-14905 at 12/3/21, 1:29 PM:


The default behavior is currently {{{}needed{}}}, so I'll stick to that as the 
default.

In the case of {{{}all{}}}, it's necessary to make a decision about whether 
quotes should be inserted for nulls.

What I'm currently brewing up doesn't insert them.
But in this case I'd opt for calling this option {{all_valid}} , exposing to 
the user that only valid (non-null) values are quoted.

The second possibility is related to ARROW-14903 where we can set a custom 
value for nulls.
If we choose to insert quotes everywhere, even if the null value is empty, it 
will produce {{{}""{}}}.
In the case of a custom null value as described in the other issue, it would 
also be enclosed in quotes.
Drawback is that it then becomes indistinguishable from possible strings that 
contain the null value.


was (Author: johanpel):
The default behavior is currently {{{}needed{}}}, so I'll stick to that as the 
default.

In the case of {{{}all{}}}, it's necessary to make a decision about whether 
quotes should be inserted for nulls.

What I'm currently brewing up doesn't insert them.
But in this case I'd opt for calling this option {{all_valid}} , exposing to 
the user that only valid (non-null) values are quoted.

If we choose to insert them, even if the null value is empty, it will produce 
{{""}} when the quote style is set to {{{}all{}}}.
This is slightly related to ARROW-14903, where in the case of a custom null 
value it would also be enclosed in quotes.
Drawback is that it is indistinguishable from possible strings that contain the 
null value.

> [C++] Enable CSV Writer to handle quoting
> -
>
> Key: ARROW-14905
> URL: https://issues.apache.org/jira/browse/ARROW-14905
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Johan Peltenburg
>Priority: Major
>
> This will allow more control over quoting. In {{readr::write_csv()}} 
> {{{}quote{}}} instructs on how to handle fields which contain characters that 
> need to be quoted: 
>  * {{{}needed{}}}: only quote fields which need them
>  * {{{}all{}}}: quote all fields - I think this might be the implicit default 
> behaviour for {{write_csv_arrow()}}
>  * {{{}none{}}}: never quote fields



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14905) [C++] Enable CSV Writer to handle quoting

2021-12-03 Thread Johan Peltenburg (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453003#comment-17453003
 ] 

Johan Peltenburg commented on ARROW-14905:
--

The default behavior is currently {{{}needed{}}}, so I'll stick to that as the 
default.

In the case of {{{}all{}}}, it's necessary to make a decision about whether 
quotes should be inserted for nulls.

What I'm currently brewing up doesn't insert them.
But in this case I'd opt for calling this option {{all_valid}} , exposing to 
the user that only valid (non-null) values are quoted.

If we choose to insert them, even if the null value is empty, it will produce 
{{""}} when the quote style is set to {{{}all{}}}.
This is slightly related to ARROW-14903, where in the case of a custom null 
value it would also be enclosed in quotes.
Drawback is that it is indistinguishable from possible strings that contain the 
null value.

> [C++] Enable CSV Writer to handle quoting
> -
>
> Key: ARROW-14905
> URL: https://issues.apache.org/jira/browse/ARROW-14905
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Johan Peltenburg
>Priority: Major
>
> This will allow more control over quoting. In {{readr::write_csv()}} 
> {{{}quote{}}} instructs on how to handle fields which contain characters that 
> need to be quoted: 
>  * {{{}needed{}}}: only quote fields which need them
>  * {{{}all{}}}: quote all fields - I think this might be the implicit default 
> behaviour for {{write_csv_arrow()}}
>  * {{{}none{}}}: never quote fields



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14977) [Python] Add a "made-up" feature for the guide tutorial

2021-12-03 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452993#comment-17452993
 ] 

Alenka Frim edited comment on ARROW-14977 at 12/3/21, 1:13 PM:
---

What do you think about this feature [~amol-]?


was (Author: alenkaf):
What do you think about this issue [~amol-]?

> [Python] Add a "made-up" feature for the guide tutorial
> ---
>
> Key: ARROW-14977
> URL: https://issues.apache.org/jira/browse/ARROW-14977
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Minor
> Fix For: 7.0.0
>
>
> To create a guide tutorial for a simple Python feature I will add a made-up 
> function to `compute.py` in order to demonstrate the process of making a 
> first contribution.
> The function will call `min_max` and increase the interval for 1 in both 
> directions. I would call the new function `tutorial_example`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14977) [Python] Add a "made-up" feature for the guide tutorial

2021-12-03 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452993#comment-17452993
 ] 

Alenka Frim commented on ARROW-14977:
-

What do you think about this issue [~amol-]?

> [Python] Add a "made-up" feature for the guide tutorial
> ---
>
> Key: ARROW-14977
> URL: https://issues.apache.org/jira/browse/ARROW-14977
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Minor
> Fix For: 7.0.0
>
>
> To create a guide tutorial for a simple Python feature I will add a made-up 
> function to `compute.py` in order to demonstrate the process of making a 
> first contribution.
> The function will call `min_max` and increase the interval for 1 in both 
> directions. I would call the new function `tutorial_example`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14977) [Python] Add a "made-up" feature for the guide tutorial

2021-12-03 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-14977:
---

 Summary: [Python] Add a "made-up" feature for the guide tutorial
 Key: ARROW-14977
 URL: https://issues.apache.org/jira/browse/ARROW-14977
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 7.0.0


To create a guide tutorial for a simple Python feature I will add a made-up 
function to `compute.py` in order to demonstrate the process of making a first 
contribution.

The function will call `min_max` and increase the interval for 1 in both 
directions. I would call the new function `tutorial_example`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14445) [C++] Implement memory management for DataHolder

2021-12-03 Thread Alexander Ocsa (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ocsa updated ARROW-14445:
---
Description: Issue https://issues.apache.org/jira/browse/ARROW-14330  will 
require the use of  memory resource  manager.  This  manager  will be used to 
decide when to spill to disk.    (was: Issue 
https://issues.apache.org/jira/browse/ARROW-14330  will require the use of  
memory stats. One method that could be added is the one to compute the use of 
memory for ExecBatches,  This method  +   ExecContext:MemoryPool stats will be 
used by the DataHolder to decide when to spill to disk.  )

> [C++] Implement memory management for DataHolder
> 
>
> Key: ARROW-14445
> URL: https://issues.apache.org/jira/browse/ARROW-14445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Assignee: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 7.0.0
>
>
> Issue https://issues.apache.org/jira/browse/ARROW-14330  will require the use 
> of  memory resource  manager.  This  manager  will be used to decide when to 
> spill to disk.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14445) [C++] Implement memory stats for DataHolder

2021-12-03 Thread Alexander Ocsa (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ocsa updated ARROW-14445:
---
Description: Issue https://issues.apache.org/jira/browse/ARROW-14330  will 
require the use of  memory stats. One method that could be added is the one to 
compute the use of memory for ExecBatches,  This method  +   
ExecContext:MemoryPool stats will be used by the DataHolder to decide when to 
spill to disk.    (was: Issue https://issues.apache.org/jira/browse/ARROW-14330 
 will require the use of  memory management. One method that could be added is 
the one to compute the use of memory for ExecBatches,  This method  +   
ExecContext:MemoryPool stats will be used by the DataHolder to decide when to 
spill to disk.  )

> [C++] Implement memory stats for DataHolder
> ---
>
> Key: ARROW-14445
> URL: https://issues.apache.org/jira/browse/ARROW-14445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Assignee: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 7.0.0
>
>
> Issue https://issues.apache.org/jira/browse/ARROW-14330  will require the use 
> of  memory stats. One method that could be added is the one to compute the 
> use of memory for ExecBatches,  This method  +   ExecContext:MemoryPool stats 
> will be used by the DataHolder to decide when to spill to disk.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14445) [C++] Implement memory management for DataHolder

2021-12-03 Thread Alexander Ocsa (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ocsa updated ARROW-14445:
---
Summary: [C++] Implement memory management for DataHolder  (was: [C++] 
Implement memory stats for DataHolder)

> [C++] Implement memory management for DataHolder
> 
>
> Key: ARROW-14445
> URL: https://issues.apache.org/jira/browse/ARROW-14445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Assignee: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 7.0.0
>
>
> Issue https://issues.apache.org/jira/browse/ARROW-14330  will require the use 
> of  memory stats. One method that could be added is the one to compute the 
> use of memory for ExecBatches,  This method  +   ExecContext:MemoryPool stats 
> will be used by the DataHolder to decide when to spill to disk.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14445) [C++] Implement memory stats for DataHolder

2021-12-03 Thread Alexander Ocsa (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ocsa updated ARROW-14445:
---
Description: Issue https://issues.apache.org/jira/browse/ARROW-14330  will 
require the use of  memory management. One method that could be added is the 
one to compute the use of memory for ExecBatches,  This method  +   
ExecContext:MemoryPool stats will be used by the DataHolder to decide when to 
spill to disk.    (was: Issue https://issues.apache.org/jira/browse/ARROW-14330 
 will require the use of  memory stats. One method that could be added is the 
one to compute the use of memory for ExecBatches,  This method  +   
ExecContext:MemoryPool stats will be used by the DataHolder to decide when to 
spill to disk.  )

> [C++] Implement memory stats for DataHolder
> ---
>
> Key: ARROW-14445
> URL: https://issues.apache.org/jira/browse/ARROW-14445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Assignee: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 7.0.0
>
>
> Issue https://issues.apache.org/jira/browse/ARROW-14330  will require the use 
> of  memory management. One method that could be added is the one to compute 
> the use of memory for ExecBatches,  This method  +   ExecContext:MemoryPool 
> stats will be used by the DataHolder to decide when to spill to disk.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14905) [C++] Enable CSV Writer to handle quoting

2021-12-03 Thread Johan Peltenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Peltenburg reassigned ARROW-14905:


Assignee: Johan Peltenburg

> [C++] Enable CSV Writer to handle quoting
> -
>
> Key: ARROW-14905
> URL: https://issues.apache.org/jira/browse/ARROW-14905
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Johan Peltenburg
>Priority: Major
>
> This will allow more control over quoting. In {{readr::write_csv()}} 
> {{{}quote{}}} instructs on how to handle fields which contain characters that 
> need to be quoted: 
>  * {{{}needed{}}}: only quote fields which need them
>  * {{{}all{}}}: quote all fields - I think this might be the implicit default 
> behaviour for {{write_csv_arrow()}}
>  * {{{}none{}}}: never quote fields



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-1569) [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types

2021-12-03 Thread Matthijs Brobbel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthijs Brobbel reassigned ARROW-1569:
---

Assignee: Matthijs Brobbel

> [C++] Kernel functions for determining monotonicity (ascending or descending) 
> for well-ordered types
> 
>
> Key: ARROW-1569
> URL: https://issues.apache.org/jira/browse/ARROW-1569
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Matthijs Brobbel
>Priority: Major
>  Labels: Analytics
>
> These kernels must offer some stateful variant so that monotonicity can be 
> determined across chunked arrays



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14413) [C++][Gandiva] Implement levenshtein function

2021-12-03 Thread Pindikura Ravindra (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-14413.

Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11522
[https://github.com/apache/arrow/pull/11522]

> [C++][Gandiva] Implement levenshtein function
> -
>
> Key: ARROW-14413
> URL: https://issues.apache.org/jira/browse/ARROW-14413
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Vinicius Souza Roque
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Returns the Levenshtein distance between two strings . For example, 
> levenshtein('kitten', 'sitting') results in 3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14048) [C++][Gandiva] Cache only object code in memory instead of entire module

2021-12-03 Thread Pindikura Ravindra (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-14048.

Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11193
[https://github.com/apache/arrow/pull/11193]

> [C++][Gandiva] Cache only object code in memory instead of entire module
> 
>
> Key: ARROW-14048
> URL: https://issues.apache.org/jira/browse/ARROW-14048
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Augusto Alves Silva
>Assignee: Projjal Chanda
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 12h 50m
>  Remaining Estimate: 0h
>
> Implement Gandiva to cache object code instead the entire llvm module, 
> improving the memory consumption and LLVM time perfomance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-12858) [C++][Gandiva] Add isNull, isTrue, isFalse, isNotTrue, IsNotFalse and NVL functions on Gandiva

2021-12-03 Thread Pindikura Ravindra (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-12858.

Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 10385
[https://github.com/apache/arrow/pull/10385]

> [C++][Gandiva] Add isNull, isTrue, isFalse, isNotTrue, IsNotFalse and NVL 
> functions on Gandiva
> --
>
> Key: ARROW-12858
> URL: https://issues.apache.org/jira/browse/ARROW-12858
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Rodrigo Jacomozzi de Bem
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14936) [C++][Gandiva] Fix split_part function in gandiva

2021-12-03 Thread Pindikura Ravindra (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-14936.

Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11819
[https://github.com/apache/arrow/pull/11819]

> [C++][Gandiva] Fix split_part function in gandiva
> -
>
> Key: ARROW-14936
> URL: https://issues.apache.org/jira/browse/ARROW-14936
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> split_part function sporadically returns error with message "Couldn't 
> allocate output string"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14976) [Dev][Archery] Print more friendly error logs for non-existent benchmark

2021-12-03 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-14976:


 Summary: [Dev][Archery] Print more friendly error logs for 
non-existent benchmark
 Key: ARROW-14976
 URL: https://issues.apache.org/jira/browse/ARROW-14976
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery
Reporter: Yibo Cai


For non-existent benchmark, archery benchmark tool outputs error messages not 
very useful.
E.g., a typo (*parser* -> *parsing*) in command _"archery benchmark diff 
--suite-filter=arrow-csv-parsing-benchmark"_ leads to below confusing error 
message:
{noformat}
Traceback (most recent call last):
  File "pandas/_libs/index.pyx", line 108, in 
pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1533, in 
pandas._libs.hashtable.Float64HashTable.get_item
TypeError: must be real number, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/cyb/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 
3361, in get_loc
return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in 
pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 111, in 
pandas._libs.index.IndexEngine.get_loc
KeyError: 'change'
{noformat}







--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14965) [Python][C++] Contention when reading Parquet files with multi-threading

2021-12-03 Thread Nick Gates (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452843#comment-17452843
 ] 

Nick Gates commented on ARROW-14965:


Really appreciate you looking into this - I will work to extract a minimal 
reproduction of this with executable code rather than the psuedo code above.

> [Python][C++] Contention when reading Parquet files with multi-threading
> 
>
> Key: ARROW-14965
> URL: https://issues.apache.org/jira/browse/ARROW-14965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 6.0.0
>Reporter: Nick Gates
>Priority: Minor
>
> I'm attempting to read a table from multiple Parquet files where I already 
> know which row_groups I want to read from each file. I also want to apply a 
> filter expression while reading. To do this my code looks roughly like this:
>  
> {code:java}
> def read_file(filepath):
>     format = ds.ParquetFileFormat(...)
>     fragment = format.make_fragment(filepath, row_groups=[0, 1, 2, ...])
>     scanner = ds.Scanner.from_fragment(
> fragment, 
> use_threads=True,
> use_async=False,
> filter=...
> )
>     return scanner.to_reader().read_all()
> with ThreadPoolExecutor() as pool:
>     pa.concat_tables(pool.map(read_file, file_paths)) {code}
> Running with a ProcessPoolExecutor, each of my 13 read_file calls takes at 
> most 2 seconds. However, with a ThreadPoolExecutor some of the read_file 
> calls take 20+ seconds.
>  
> I've tried running this with various combinations of use_threads and 
> use_async to try and see what's happening. The code blocks are sourced from 
> py-spy, and identifying contention was done with viztracer.
>  
> *use_threads: False, use_async: False*
>  * It looks like pyarrow._dataset.Scanner.to_reader doesn't release the GIL: 
> [https://github.com/apache/arrow/blob/be9a22b9b76d9cd83d85d52ffc2844056d90f367/python/pyarrow/_dataset.pyx#L3278-L3283]
>  * pyarrow._dataset.from_fragment seems to be contended. Py-spy suggests this 
> is around getting the physical_schema from the fragment?
>  
> {code:java}
> from_fragment (pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so)
> __pyx_getprop_7pyarrow_8_dataset_8Fragment_physical_schema 
> (pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so)
> __pthread_cond_timedwait (libpthread-2.17.so) {code}
>  
> *use_threads: False, use_async: True*
>  * There's no longer any contention for pyarrow._dataset.from_fragment
>  * But there's lots of contention for pyarrow.lib.RecordBatchReader.read_all
>  
> {code:java}
> arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600)
> arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext 
> (pyarrow/libarrow_dataset.so.600)
> arrow::Iterator::Next
>  > (pyarrow/libarrow_dataset.so.600)
> arrow::FutureImpl::Wait (pyarrow/libarrow.so.600) 
> std::condition_variable::wait (libstdc++.so.6.0.19){code}
> *use_threads: True, use_async: False*
>  * Appears to be some contention on Scanner.to_reader
>  * But most contention remains for RecordBatchReader.read_all
> {code:java}
> arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600)
> arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext 
> (pyarrow/libarrow_dataset.so.600)
> arrow::Iterator::Next  
> namespace)::SyncScanner::ScanBatches(arrow::Iterator
>  >)::{lambda()#1}, arrow::dataset::TaggedRecordBatch> > 
> (pyarrow/libarrow_dataset.so.600)
> std::condition_variable::wait (libstdc++.so.6.0.19)
> __pthread_cond_wait (libpthread-2.17.so) {code}
> *use_threads: True, use_async: True*
>  * Contention again mostly for RecordBatchReader.read_all, but seems to 
> complete in ~12 seconds rather than 20
> {code:java}
> arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600)
> arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext 
> (pyarrow/libarrow_dataset.so.600)
> arrow::Iterator::Next
>  > (pyarrow/libarrow_dataset.so.600)
> arrow::FutureImpl::Wait (pyarrow/libarrow.so.600)
> std::condition_variable::wait (libstdc++.so.6.0.19)
> __pthread_cond_wait (libpthread-2.17.so) {code}
> Is this expected behaviour? Or should it be possible to achieve the same 
> performance from multi-threading as from multi-processing?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation

2021-12-03 Thread Zixi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zixi Wang updated ARROW-14975:
--
External issue URL: https://github.com/apache/arrow/pull/11848

> [C++][Docs] Typo in in emit_dictionary_deltas documentation
> ---
>
> Key: ARROW-14975
> URL: https://issues.apache.org/jira/browse/ARROW-14975
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Zixi Wang
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 6.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be 
> emitted, not omitted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation

2021-12-03 Thread Zixi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zixi Wang updated ARROW-14975:
--
Labels: pull-request-available  (was: )

> [C++][Docs] Typo in in emit_dictionary_deltas documentation
> ---
>
> Key: ARROW-14975
> URL: https://issues.apache.org/jira/browse/ARROW-14975
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Zixi Wang
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 6.0.1
>
>
> Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be 
> emitted, not omitted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation

2021-12-03 Thread Zixi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zixi Wang updated ARROW-14975:
--
Issue Type: Improvement  (was: Bug)

> [C++][Docs] Typo in in emit_dictionary_deltas documentation
> ---
>
> Key: ARROW-14975
> URL: https://issues.apache.org/jira/browse/ARROW-14975
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Zixi Wang
>Priority: Trivial
> Fix For: 6.0.1
>
>
> Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be 
> emitted, not omitted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation

2021-12-03 Thread Zixi Wang (Jira)
Zixi Wang created ARROW-14975:
-

 Summary: [C++][Docs] Typo in in emit_dictionary_deltas 
documentation
 Key: ARROW-14975
 URL: https://issues.apache.org/jira/browse/ARROW-14975
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation
Reporter: Zixi Wang
 Fix For: 6.0.1


Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be 
emitted, not omitted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14737) [C++][Dataset] Support URI-decoding partition keys

2021-12-03 Thread Akhil (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akhil reassigned ARROW-14737:
-

Assignee: Akhil

> [C++][Dataset] Support URI-decoding partition keys
> --
>
> Key: ARROW-14737
> URL: https://issues.apache.org/jira/browse/ARROW-14737
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: Akhil
>Priority: Major
>  Labels: dataset, good-first-issue
>
> From [GitHub issue #11718|https://github.com/apache/arrow/issues/11718], 
> Delta Lake can URL-encode partition keys. We should add an option as we did 
> in ARROW-12644 to decode them as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)