[jira] [Commented] (ARROW-14982) [C++][Python] Create utils for deep-copying an Array/ ArrayData
[ https://issues.apache.org/jira/browse/ARROW-14982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453258#comment-17453258 ] David Li commented on ARROW-14982: -- We could perhaps wrap Concatenate in a kernel and name it deep_copy. (I think Concatenate on a single array suffices to copy the data.) > [C++][Python] Create utils for deep-copying an Array/ ArrayData > --- > > Key: ARROW-14982 > URL: https://issues.apache.org/jira/browse/ARROW-14982 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Niranda Perera >Priority: Major > > Hi, I would like to request a util to deep copy an Array with the following > semantic. > > {code:java} > Result> DeepCopyArrayData(const ArrayData& arr, int64 > offset = 0, int64 length = -1/*, pool=...*/); > {code} > > This was discussed some time back in Zulip [1]. > > [1] > [https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13672) [C++] BinaryBuilder doesn't preserve passed in DataType
[ https://issues.apache.org/jira/browse/ARROW-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453257#comment-17453257 ] David Li commented on ARROW-13672: -- [~supun] probably the builder should store the given type and use it in [FinishInternal|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L317] (and presumably in [type()|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390]). > [C++] BinaryBuilder doesn't preserve passed in DataType > --- > > Key: ARROW-13672 > URL: https://issues.apache.org/jira/browse/ARROW-13672 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 5.0.0 >Reporter: Micah Kornfield >Assignee: Supun Kamburugamuva >Priority: Minor > Labels: beginner, good-first-issue > > There is a > [constructor|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L56] > that takes a datatype for binary builder but it is discarded. When > constructing an Array the type is always the value returned from type() > [binary|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390] > If a consumer of the API wants to have an extension array this prevents them > from passing the extension type though. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14982) [C++][Python] Create utils for deep-copying an Array/ ArrayData
[ https://issues.apache.org/jira/browse/ARROW-14982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranda Perera updated ARROW-14982: --- Description: Hi, I would like to request a util to deep copy an Array with the following semantic. {code:java} Result> DeepCopyArrayData(const ArrayData& arr, int64 offset = 0, int64 length = -1/*, pool=...*/); {code} This was discussed some time back in Zulip [1]. [1] [https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers] was: Hi, I would like to request a util to deep copy an Array with the following semantic. {code:java} Result> DeepCopyArrayData(const ArrayData& arr, int64 offset = 0, int64 length = arr->length()/*, pool=...*/); {code} This was discussed some time back in Zulip [1]. [1] https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers > [C++][Python] Create utils for deep-copying an Array/ ArrayData > --- > > Key: ARROW-14982 > URL: https://issues.apache.org/jira/browse/ARROW-14982 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Niranda Perera >Priority: Major > > Hi, I would like to request a util to deep copy an Array with the following > semantic. > > {code:java} > Result> DeepCopyArrayData(const ArrayData& arr, int64 > offset = 0, int64 length = -1/*, pool=...*/); > {code} > > This was discussed some time back in Zulip [1]. > > [1] > [https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14982) [C++][Python] Create utils for deep-copying an Array/ ArrayData
Niranda Perera created ARROW-14982: -- Summary: [C++][Python] Create utils for deep-copying an Array/ ArrayData Key: ARROW-14982 URL: https://issues.apache.org/jira/browse/ARROW-14982 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Niranda Perera Hi, I would like to request a util to deep copy an Array with the following semantic. {code:java} Result> DeepCopyArrayData(const ArrayData& arr, int64 offset = 0, int64 length = arr->length()/*, pool=...*/); {code} This was discussed some time back in Zulip [1]. [1] https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/create.20a.20deep.20copy.20of.20buffers -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-13672) [C++] BinaryBuilder doesn't preserve passed in DataType
[ https://issues.apache.org/jira/browse/ARROW-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Supun Kamburugamuva reassigned ARROW-13672: --- Assignee: Supun Kamburugamuva > [C++] BinaryBuilder doesn't preserve passed in DataType > --- > > Key: ARROW-13672 > URL: https://issues.apache.org/jira/browse/ARROW-13672 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 5.0.0 >Reporter: Micah Kornfield >Assignee: Supun Kamburugamuva >Priority: Minor > Labels: beginner, good-first-issue > > There is a > [constructor|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L56] > that takes a datatype for binary builder but it is discarded. When > constructing an Array the type is always the value returned from type() > [binary|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390] > If a consumer of the API wants to have an extension array this prevents them > from passing the extension type though. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13672) [C++] BinaryBuilder doesn't preserve passed in DataType
[ https://issues.apache.org/jira/browse/ARROW-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453217#comment-17453217 ] Supun Kamburugamuva commented on ARROW-13672: - Should the solution be that we remove passing the type to the constructor? > [C++] BinaryBuilder doesn't preserve passed in DataType > --- > > Key: ARROW-13672 > URL: https://issues.apache.org/jira/browse/ARROW-13672 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 5.0.0 >Reporter: Micah Kornfield >Priority: Minor > Labels: beginner, good-first-issue > > There is a > [constructor|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L56] > that takes a datatype for binary builder but it is discarded. When > constructing an Array the type is always the value returned from type() > [binary|https://github.com/apache/arrow/blob/1430c93f68960e10a50d27f465eb174e76ac06b2/cpp/src/arrow/array/builder_binary.h#L390] > If a consumer of the API wants to have an extension array this prevents them > from passing the extension type though. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14981) [CI][Docs] Upload built documents
[ https://issues.apache.org/jira/browse/ARROW-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14981: --- Labels: pull-request-available (was: ) > [CI][Docs] Upload built documents > - > > Key: ARROW-14981 > URL: https://issues.apache.org/jira/browse/ARROW-14981 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Documentation >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We can use this in release process instead of building on release manager's > local environment. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14981) [CI][Docs] Upload built documents
Kouhei Sutou created ARROW-14981: Summary: [CI][Docs] Upload built documents Key: ARROW-14981 URL: https://issues.apache.org/jira/browse/ARROW-14981 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Documentation Reporter: Kouhei Sutou Assignee: Kouhei Sutou We can use this in release process instead of building on release manager's local environment. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13735) [Python] Creating a Map array with non-default field names segfaults
[ https://issues.apache.org/jira/browse/ARROW-13735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13735: --- Labels: pull-request-available (was: ) > [Python] Creating a Map array with non-default field names segfaults > > > Key: ARROW-13735 > URL: https://issues.apache.org/jira/browse/ARROW-13735 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > With ARROW-13696, you can create a MapType with non-default field names (the > default being "key" and "value"). > However, when then trying to create an array with it from python tuples, it > crashes: > {code:python} > >>> t = pa.map_(pa.field("name", "string", nullable=False), "int64") > >>> pa.array([[('a', 1), ('b', 2)], [('c', 3)]], type=t) > ../src/arrow/array/array_nested.cc:192: Check failed: > self->list_type_->value_type()->Equals(data->child_data[0]->type) > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b882)[0x7f298d497882] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b800)[0x7f298d497800] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xf0b822)[0x7f298d497822] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow4util8ArrowLogD1Ev+0x47)[0x7f298d497b81] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xb39d31)[0x7f298d0c5d31] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x198)[0x7f298d0c06be] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow8MapArrayC1ERKSt10shared_ptrINS_9ArrayDataEE+0x64)[0x7f298d0bed14] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN9__gnu_cxx13new_allocatorIN5arrow8MapArrayEE9constructIS2_JRKSt10shared_ptrINS1_9ArrayDataEvPT_DpOT0_+0x49)[0x7f298d1a0f13] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt16allocator_traitsISaIN5arrow8MapArrayEEE9constructIS1_JRKSt10shared_ptrINS0_9ArrayDataEvRS2_PT_DpOT0_+0x38)[0x7f298d19ebe6] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt23_Sp_counted_ptr_inplaceIN5arrow8MapArrayESaIS1_ELN9__gnu_cxx12_Lock_policyE2EEC1IJRKSt10shared_ptrINS0_9ArrayDataES2_DpOT_+0xaf)[0x7f298d19b547] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN5arrow8MapArrayESaIS5_EJRKSt10shared_ptrINS4_9ArrayDataERPT_St20_Sp_alloc_shared_tagIT0_EDpOT1_+0xb2)[0x7f298d195a64] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt12__shared_ptrIN5arrow8MapArrayELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJRKSt10shared_ptrINS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x4c)[0x7f298d1918bc] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZNSt10shared_ptrIN5arrow8MapArrayEEC1ISaIS1_EJRKS_INS0_9ArrayDataESt20_Sp_alloc_shared_tagIT_EDpOT0_+0x39)[0x7f298d18f617] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt15allocate_sharedIN5arrow8MapArrayESaIS1_EJRKSt10shared_ptrINS0_9ArrayDataS3_IT_ERKT0_DpOT1_+0x38)[0x7f298d18d254] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZSt11make_sharedIN5arrow8MapArrayEJRKSt10shared_ptrINS0_9ArrayDataS2_IT_EDpOT0_+0x54)[0x7f298d1897b7] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbf5d6a)[0x7f298d181d6a] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(+0xbef0f3)[0x7f298d17b0f3] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x99)[0x7f298d173f6b] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEPSt10shared_ptrINS_5ArrayEE+0x115)[0x7f298d0e4ed9] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.600(_ZN5arrow12ArrayBuilder6FinishEv+0x47)[0x7f298d0e4fb7] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28cc91)[0x7f29d05d2c91] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x292774)[0x7f29d05d8774] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x28ca00)[0x7f29d05d2a00] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(+0x288f63)[0x7f29d05cef63] > /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_python.so.600(_ZN5arrow2py17ConvertPySequenceEP7_objectS2_NS0_19PyConversionOptionsEPNS_10MemoryPoolE+0xa9d)[0x7f29d05cadb7] > /home/joris/scipy/repos/arrow/python/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c890d)[0x7f29d08f190d] > /home/joris/miniconda3/envs/arrow-dev/bin/python(PyCFunction_Call+0x54)[0x5581d331a814] > /home/joris/miniconda
[jira] [Updated] (ARROW-12629) [C++] Configurable read-ahead in CSV and JSON readers
[ https://issues.apache.org/jira/browse/ARROW-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12629: --- Labels: good-first-issue pull-request-available (was: good-first-issue) > [C++] Configurable read-ahead in CSV and JSON readers > - > > Key: ARROW-12629 > URL: https://issues.apache.org/jira/browse/ARROW-12629 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Andre Kohn >Assignee: Supun Kamburugamuva >Priority: Major > Labels: good-first-issue, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We are compiling Arrow C++ to WebAssembly and ran into the following issue > with the CSV reader: > Browsers became very picky about the use of SharedArrayBuffers after the > events around Spectre and Meltdown. > As a result, you have to compile Arrow to WebAssembly without threads if you > don't want to run your website with very strict cross-origin isolation. > Unfortunately, the CSV reader seems to always spawn a thread for the > read-ahead in both, the SerialStreamingReader and the SerialTableReader > independent of whether use_threads is set. > Right now, this effectively means that you cannot use the CSV (and JSON) > readers in threadless WebAssembly builds. > > [https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L839] > [https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L913] > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14879) [Python][Packaging] Remove manylinux2010 wheels
[ https://issues.apache.org/jira/browse/ARROW-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook resolved ARROW-14879. -- Fix Version/s: (was: 7.0.0) Resolution: Won't Fix Closing as "won't fix" for now. Quoting [~kszucs]: {quote}I managed to [patch[elf] the vcpkg binary with glibc 2.18|https://github.com/apache/arrow/pull/11569/files#diff-6f627de1c3985ba6a76addecbcb7b3cb88ca0ca6244d91339a348cf4d48db914R52-R53] in the manylinux2010 image, so we should be able to maintain the builds until the vcpkg dependencies compile. Closing in favor of [{{2233ac5}}|https://github.com/apache/arrow/commit/2233ac5782e52015d1f51ac3f7dd201c1262a947] {quote} > [Python][Packaging] Remove manylinux2010 wheels > --- > > Key: ARROW-14879 > URL: https://issues.apache.org/jira/browse/ARROW-14879 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > More recent vcpkg is not compatible with older glibc shipped by manylinux2010 > so we won't be able to regularly update the dependencies. Besides that > manylinux2010 has reached EOL. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13035) [C++] Create a compute function returning indices of non-zero values
[ https://issues.apache.org/jira/browse/ARROW-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453190#comment-17453190 ] Niranda Perera commented on ARROW-13035: [~amol-] Shall I take it up? I think it is fairly straight forward. > [C++] Create a compute function returning indices of non-zero values > > > Key: ARROW-13035 > URL: https://issues.apache.org/jira/browse/ARROW-13035 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Alessandro Molina >Priority: Major > Labels: good-first-issue > > This would be similar to Numpy's {{nonzero}} function: > [https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html] > {code:python} > >>> arr = np.array([4,5,0,6,0,5]) > >>> np.nonzero(arr) > (array([0, 1, 3, 5]),) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453188#comment-17453188 ] Joris Van den Bossche commented on ARROW-13168: --- The same is the case for python: https://pypi.org/project/tzdata/ Ideally we would be able to specify the path at runtime, I think. > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453186#comment-17453186 ] Dewey Dunnington commented on ARROW-13168: -- We can force it to be present at build-time, force it to be present at install time or force it to be present at runtime (probably what we'd do). > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14961) [C++] Bump version on Google Benchmark
[ https://issues.apache.org/jira/browse/ARROW-14961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook resolved ARROW-14961. -- Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11833 [https://github.com/apache/arrow/pull/11833] > [C++] Bump version on Google Benchmark > -- > > Key: ARROW-14961 > URL: https://issues.apache.org/jira/browse/ARROW-14961 > Project: Apache Arrow > Issue Type: Task > Components: Benchmarking, C++ >Reporter: Sasha Krassovsky >Assignee: Sasha Krassovsky >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > Google Benchmark v1.6.0 came out - I'd like to use a couple of functions it > provides in a different issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-1699) [C++] Forward, backward fill kernel functions
[ https://issues.apache.org/jira/browse/ARROW-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1699: -- Labels: analytics dataframe pull-request-available (was: analytics dataframe) > [C++] Forward, backward fill kernel functions > - > > Key: ARROW-1699 > URL: https://issues.apache.org/jira/browse/ARROW-1699 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Alvin Chunga Mamani >Priority: Major > Labels: analytics, dataframe, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Like ffill / bfill in pandas (with limit) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14979) [C++] GCS integration tests leak processes
[ https://issues.apache.org/jira/browse/ARROW-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14979: --- Labels: pull-request-available (was: ) > [C++] GCS integration tests leak processes > -- > > Key: ARROW-14979 > URL: https://issues.apache.org/jira/browse/ARROW-14979 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Carlos O'Ryan >Assignee: Carlos O'Ryan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The GCS integration tests fail to fully shutdown all the processes created to > run the GCS testbench. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14980) [C++] GcsFileSystem tests should use PYTHON environment variable
Carlos O'Ryan created ARROW-14980: - Summary: [C++] GcsFileSystem tests should use PYTHON environment variable Key: ARROW-14980 URL: https://issues.apache.org/jira/browse/ARROW-14980 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Carlos O'Ryan -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14506) [C++] Add GCS library to conda files
[ https://issues.apache.org/jira/browse/ARROW-14506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14506: --- Labels: pull-request-available (was: ) > [C++] Add GCS library to conda files > > > Key: ARROW-14506 > URL: https://issues.apache.org/jira/browse/ARROW-14506 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Carlos O'Ryan >Assignee: Carlos O'Ryan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > It seems that the conda package for `google-cloud-cpp` is not usable: > https://github.com/conda-forge/google-cloud-cpp-feedstock/issues/68 > this is a reminder to add the dependency once the package is fixed. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14979) [C++] GCS integration tests leak processes
[ https://issues.apache.org/jira/browse/ARROW-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos O'Ryan reassigned ARROW-14979: - Assignee: Carlos O'Ryan > [C++] GCS integration tests leak processes > -- > > Key: ARROW-14979 > URL: https://issues.apache.org/jira/browse/ARROW-14979 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Carlos O'Ryan >Assignee: Carlos O'Ryan >Priority: Major > > The GCS integration tests fail to fully shutdown all the processes created to > run the GCS testbench. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14979) [C++] GCS integration tests leak processes
Carlos O'Ryan created ARROW-14979: - Summary: [C++] GCS integration tests leak processes Key: ARROW-14979 URL: https://issues.apache.org/jira/browse/ARROW-14979 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Carlos O'Ryan The GCS integration tests fail to fully shutdown all the processes created to run the GCS testbench. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14506) [C++] Add GCS library to conda files
[ https://issues.apache.org/jira/browse/ARROW-14506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos O'Ryan reassigned ARROW-14506: - Assignee: Carlos O'Ryan > [C++] Add GCS library to conda files > > > Key: ARROW-14506 > URL: https://issues.apache.org/jira/browse/ARROW-14506 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Carlos O'Ryan >Assignee: Carlos O'Ryan >Priority: Major > > It seems that the conda package for `google-cloud-cpp` is not usable: > https://github.com/conda-forge/google-cloud-cpp-feedstock/issues/68 > this is a reminder to add the dependency once the package is fixed. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14941) [R] Implement Duration R6 class and bindings for lubridate::duration()
[ https://issues.apache.org/jira/browse/ARROW-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14941: --- Labels: pull-request-available (was: ) > [R] Implement Duration R6 class and bindings for lubridate::duration() > -- > > Key: ARROW-14941 > URL: https://issues.apache.org/jira/browse/ARROW-14941 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453165#comment-17453165 ] Rok Mihevc commented on ARROW-13168: Ah, I see your point. Neat! Is tzdb package always present? > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453161#comment-17453161 ] Dewey Dunnington commented on ARROW-13168: -- For sure you don't want to depend on it being present from C++! From our end on the R bindings, though, it means we can support Windows through the tzdb R package (if C++ lets us point at a directory with a database in the right format). > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
[ https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Henry updated ARROW-14978: --- Description: *Description:* Unable to import flight module in pyarrow in version 6.0.0 and 6.0.1. Is there a different package I can install that provides the module? Or is this potentially an issue with me using apple silicon? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/path/to/repo/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 *Environment* MacBook Pro (Apple Silicon) was: *Description:* Unable to import flight module in pyarrow since version 6.0.0. Is there a different package I can install that provides the module? Or is this potentially an issue with me using apple silicon? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/path/to/repo/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 *Environment* MacBook Pro (Apple Silicon) > Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1 > -- > > Key: ARROW-14978 > URL: https://issues.apache.org/jira/browse/ARROW-14978 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 6.0.0, 6.0.1 >Reporter: Eric Henry >Priority: Major > > *Description:* > Unable to import flight module in pyarrow in version 6.0.0 and 6.0.1. Is > there a different package I can install that provides the module? Or is this > potentially an issue with me using apple silicon? > > *Error:* > >>>import pyarrow.flight > Traceback (most recent call last): > File "", line 1, in > File "/path/to/repo/pyarrow/flight.py", line 18, in > from pyarrow._flight import ( # noqa:F401 > ModuleNotFoundError: No module named 'pyarrow._flight' > > *How to replicate* > pip3.8 install pyarrow==6.0.1 > python3.8 -c 'import pyarrow.flight' > > *Python version:* > Python version: 3.8 > > *Environment* > MacBook Pro (Apple Silicon) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
[ https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Henry updated ARROW-14978: --- Description: *Description:* Unable to import flight module in pyarrow since version 6.0.0. Is there a different package I can install that provides the module? Or is this potentially an issue with me using apple silicon? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/path/to/repo/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 *Environment* MacBook Pro (Apple Silicon) was: *Description:* Unable to import flight module in pyarrow since version 6.0.0. This has been preventing me from upgrading my version of pyarrow. Will this no longer be included in the wheel? If so, is there a Jira ticket or announcement so I can get more information? Is there a different package I can install that provides the module? Or is this potentially an issue with me using apple silicon? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/path/to/repo/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 *Environment* MacBook Pro (Apple Silicon) > Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1 > -- > > Key: ARROW-14978 > URL: https://issues.apache.org/jira/browse/ARROW-14978 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 6.0.0, 6.0.1 >Reporter: Eric Henry >Priority: Major > > *Description:* > Unable to import flight module in pyarrow since version 6.0.0. Is there a > different package I can install that provides the module? Or is this > potentially an issue with me using apple silicon? > > *Error:* > >>>import pyarrow.flight > Traceback (most recent call last): > File "", line 1, in > File "/path/to/repo/pyarrow/flight.py", line 18, in > from pyarrow._flight import ( # noqa:F401 > ModuleNotFoundError: No module named 'pyarrow._flight' > > *How to replicate* > pip3.8 install pyarrow==6.0.1 > python3.8 -c 'import pyarrow.flight' > > *Python version:* > Python version: 3.8 > > *Environment* > MacBook Pro (Apple Silicon) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14903) [C++] Enable CSV Writer to control string to be used for missing data
[ https://issues.apache.org/jira/browse/ARROW-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-14903. -- Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11836 [https://github.com/apache/arrow/pull/11836] > [C++] Enable CSV Writer to control string to be used for missing data > - > > Key: ARROW-14903 > URL: https://issues.apache.org/jira/browse/ARROW-14903 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Johan Peltenburg >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 7.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > This would allow the user to control how missing values are written to a CSV > file using the R {{write_csv_arrow()}} functionality. > {{{}na{}}}: string used for missing values. Defaults to {{{}NA{}}}. Missing > values are never quoted; strings with the same value as {{na}} will always be > quoted. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
[ https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Henry updated ARROW-14978: --- Description: *Description:* Unable to import flight module in pyarrow since version 6.0.0. This has been preventing me from upgrading my version of pyarrow. Will this no longer be included in the wheel? If so, is there a Jira ticket or announcement so I can get more information? Is there a different package I can install that provides the module? Or is this potentially an issue with me using apple silicon? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/path/to/repo/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 *Environment* MacBook Pro (Apple Silicon) was: *Description:* Unable to import flight module in pyarrow since version 6.0.0. This has been preventing me from upgrading my version of pyarrow. Will this no longer be included in the wheel? If so, is there a Jira ticket or announcement so I can get more information? Is there a different package I can install that provides the module? Or is this potentially an issue with me using apple silicon? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 *Environment* MacBook Pro (Apple Silicon) > Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1 > -- > > Key: ARROW-14978 > URL: https://issues.apache.org/jira/browse/ARROW-14978 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 6.0.0, 6.0.1 >Reporter: Eric Henry >Priority: Major > > *Description:* > Unable to import flight module in pyarrow since version 6.0.0. This has been > preventing me from upgrading my version of pyarrow. Will this no longer be > included in the wheel? If so, is there a Jira ticket or announcement so I can > get more information? Is there a different package I can install that > provides the module? Or is this potentially an issue with me using apple > silicon? > > *Error:* > >>>import pyarrow.flight > Traceback (most recent call last): > File "", line 1, in > File "/path/to/repo/pyarrow/flight.py", line 18, in > from pyarrow._flight import ( # noqa:F401 > ModuleNotFoundError: No module named 'pyarrow._flight' > > *How to replicate* > pip3.8 install pyarrow==6.0.1 > python3.8 -c 'import pyarrow.flight' > > *Python version:* > Python version: 3.8 > > *Environment* > MacBook Pro (Apple Silicon) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
[ https://issues.apache.org/jira/browse/ARROW-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Henry updated ARROW-14978: --- Description: *Description:* Unable to import flight module in pyarrow since version 6.0.0. This has been preventing me from upgrading my version of pyarrow. Will this no longer be included in the wheel? If so, is there a Jira ticket or announcement so I can get more information? Is there a different package I can install that provides the module? Or is this potentially an issue with me using apple silicon? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 *Environment* MacBook Pro (Apple Silicon) was: *Description:* Unable to import flight module in pyarrow since version 6.0.0. This has been preventing me from upgrading my version of pyarrow. Will this no longer be included in the wheel? If so, is there a Jira ticket or announcement so I can get more information? Is there a different package I can install that provides the module? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 > Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1 > -- > > Key: ARROW-14978 > URL: https://issues.apache.org/jira/browse/ARROW-14978 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 6.0.0, 6.0.1 >Reporter: Eric Henry >Priority: Major > > *Description:* > Unable to import flight module in pyarrow since version 6.0.0. This has been > preventing me from upgrading my version of pyarrow. Will this no longer be > included in the wheel? If so, is there a Jira ticket or announcement so I can > get more information? Is there a different package I can install that > provides the module? Or is this potentially an issue with me using apple > silicon? > > *Error:* > >>>import pyarrow.flight > Traceback (most recent call last): > File "", line 1, in > File > "/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py", > line 18, in > from pyarrow._flight import ( # noqa:F401 > ModuleNotFoundError: No module named 'pyarrow._flight' > > *How to replicate* > pip3.8 install pyarrow==6.0.1 > python3.8 -c 'import pyarrow.flight' > > *Python version:* > Python version: 3.8 > > *Environment* > MacBook Pro (Apple Silicon) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14978) Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1
Eric Henry created ARROW-14978: -- Summary: Import Error for pyarrow.flight in pyarrow 6.0.0 and 6.0.1 Key: ARROW-14978 URL: https://issues.apache.org/jira/browse/ARROW-14978 Project: Apache Arrow Issue Type: Bug Affects Versions: 6.0.1, 6.0.0 Reporter: Eric Henry *Description:* Unable to import flight module in pyarrow since version 6.0.0. This has been preventing me from upgrading my version of pyarrow. Will this no longer be included in the wheel? If so, is there a Jira ticket or announcement so I can get more information? Is there a different package I can install that provides the module? *Error:* >>>import pyarrow.flight Traceback (most recent call last): File "", line 1, in File "/Users/erichenry/PycharmProjects/flying-pandas/venv/lib/python3.8/site-packages/pyarrow/flight.py", line 18, in from pyarrow._flight import ( # noqa:F401 ModuleNotFoundError: No module named 'pyarrow._flight' *How to replicate* pip3.8 install pyarrow==6.0.1 python3.8 -c 'import pyarrow.flight' *Python version:* Python version: 3.8 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453136#comment-17453136 ] Rok Mihevc commented on ARROW-13168: Indeed it appears to be [~willjones127]. It also seems to be doing the bundling approach - including the timezone data with the install - [https://github.com/r-lib/tzdb/tree/main/inst/tzdata] We also use date.h and tz.h (https://github.com/HowardHinnant/date#summary) in Arrow for almost all time related things. > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453130#comment-17453130 ] Will Jones commented on ARROW-13168: It looks like that tzdb is a wrapper around this C++ library: [https://github.com/HowardHinnant/date] I also saw that same library recommended in [this StackOverflow answer by a Microsoft employee about Windows TZ libraries|https://stackoverflow.com/a/47106207/2048858]. > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14905) [C++] Enable CSV Writer to handle quoting
[ https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14905: --- Labels: pull-request-available (was: ) > [C++] Enable CSV Writer to handle quoting > - > > Key: ARROW-14905 > URL: https://issues.apache.org/jira/browse/ARROW-14905 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Johan Peltenburg >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This will allow more control over quoting. In {{readr::write_csv()}} > {{{}quote{}}} instructs on how to handle fields which contain characters that > need to be quoted: > * {{{}needed{}}}: only quote fields which need them > * {{{}all{}}}: quote all fields - I think this might be the implicit default > behaviour for {{write_csv_arrow()}} > * {{{}none{}}}: never quote fields -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453113#comment-17453113 ] Rok Mihevc commented on ARROW-13168: Nice to see interest on this [~paleolimbot]! I will probably pick it up next. I'm not sure we want to depend on R/Python libraries being present from C++, but I'll look into this option, thanks! > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13035) [C++] Create a compute function returning indices of non-zero values
[ https://issues.apache.org/jira/browse/ARROW-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453109#comment-17453109 ] Alessandro Molina commented on ARROW-13035: --- I just took it the other day to get remember tackling it, but I haven't yet done any work as I'm finishing another one. I hope to be able to include it in 7.0.0 btw > [C++] Create a compute function returning indices of non-zero values > > > Key: ARROW-13035 > URL: https://issues.apache.org/jira/browse/ARROW-13035 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Alessandro Molina >Priority: Major > Labels: good-first-issue > > This would be similar to Numpy's {{nonzero}} function: > [https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html] > {code:python} > >>> arr = np.array([4,5,0,6,0,5]) > >>> np.nonzero(arr) > (array([0, 1, 3, 5]),) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-13016) [C++] Support Null type in Sum/Mean aggregation
[ https://issues.apache.org/jira/browse/ARROW-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-13016. -- Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 10486 [https://github.com/apache/arrow/pull/10486] > [C++] Support Null type in Sum/Mean aggregation > --- > > Key: ARROW-13016 > URL: https://issues.apache.org/jira/browse/ARROW-13016 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Chenxi Li >Assignee: Chenxi Li >Priority: Minor > Labels: kernel, pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13016) [C++] Support Null type in Sum/Mean aggregation
[ https://issues.apache.org/jira/browse/ARROW-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-13016: - Labels: kernel pull-request-available (was: pull-request-available) > [C++] Support Null type in Sum/Mean aggregation > --- > > Key: ARROW-13016 > URL: https://issues.apache.org/jira/browse/ARROW-13016 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Chenxi Li >Assignee: Chenxi Li >Priority: Minor > Labels: kernel, pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-13811) [Java] Provide a general out-of-place sorter
[ https://issues.apache.org/jira/browse/ARROW-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan resolved ARROW-13811. -- Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11035 [https://github.com/apache/arrow/pull/11035] > [Java] Provide a general out-of-place sorter > > > Key: ARROW-13811 > URL: https://issues.apache.org/jira/browse/ARROW-13811 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > The sorter should work for any type of vectors, with a time complexity of > O(n*log( n )). > Since it does not make any assumptions about the memory layout of the vector, > its performance can be sub-optimal. So if another sorter is applicable > (e.g.{{FixedWidthInPlaceVectorSorter}}), it should be used in preference. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453018#comment-17453018 ] Dewey Dunnington commented on ARROW-13168: -- Just a +1 for a "runtime configuration" option. In R we have the [tzdb package|]. Currently it only provides the text format of the IANA database but we could use that approach if we need something different (maintained sepaerately to keep it up to date). I'm less familiar with Python but I imagine something similar exist there, too. {code:R} list.files(tzdb::tzdb_path("text")) #> [1] "africa""antarctica""asia" #> [4] "australasia" "backward" "backzone" #> [7] "calendars" "checklinks.awk""checktab.awk" #> [10] "CONTRIBUTING" "etcetera" "europe" #> [13] "factory" "iso3166.tab" "leap-seconds.list" #> [16] "leapseconds" "leapseconds.awk" "LICENSE" #> [19] "Makefile" "NEWS" "northamerica" #> [22] "README""southamerica" "theory.html" #> [25] "version" "windowsZones.xml" "ziguard.awk" #> [28] "zishrink.awk" "zone.tab" "zone1970.tab" #> [31] "zoneinfo2tdf.pl" {code} > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453018#comment-17453018 ] Dewey Dunnington edited comment on ARROW-13168 at 12/3/21, 1:53 PM: Just a +1 for a "runtime configuration" option. In R we have the [tzdb package|https://github.com/r-lib/tzdb]. Currently it only provides the text format of the IANA database but we could use that approach if we need something different (maintained sepaerately to keep it up to date). I'm less familiar with Python but I imagine something similar exist there, too. {code:R} list.files(tzdb::tzdb_path("text")) #> [1] "africa""antarctica""asia" #> [4] "australasia" "backward" "backzone" #> [7] "calendars" "checklinks.awk""checktab.awk" #> [10] "CONTRIBUTING" "etcetera" "europe" #> [13] "factory" "iso3166.tab" "leap-seconds.list" #> [16] "leapseconds" "leapseconds.awk" "LICENSE" #> [19] "Makefile" "NEWS" "northamerica" #> [22] "README""southamerica" "theory.html" #> [25] "version" "windowsZones.xml" "ziguard.awk" #> [28] "zishrink.awk" "zone.tab" "zone1970.tab" #> [31] "zoneinfo2tdf.pl" {code} was (Author: paleolimbot): Just a +1 for a "runtime configuration" option. In R we have the [tzdb package|]. Currently it only provides the text format of the IANA database but we could use that approach if we need something different (maintained sepaerately to keep it up to date). I'm less familiar with Python but I imagine something similar exist there, too. {code:R} list.files(tzdb::tzdb_path("text")) #> [1] "africa""antarctica""asia" #> [4] "australasia" "backward" "backzone" #> [7] "calendars" "checklinks.awk""checktab.awk" #> [10] "CONTRIBUTING" "etcetera" "europe" #> [13] "factory" "iso3166.tab" "leap-seconds.list" #> [16] "leapseconds" "leapseconds.awk" "LICENSE" #> [19] "Makefile" "NEWS" "northamerica" #> [22] "README""southamerica" "theory.html" #> [25] "version" "windowsZones.xml" "ziguard.awk" #> [28] "zishrink.awk" "zone.tab" "zone1970.tab" #> [31] "zoneinfo2tdf.pl" {code} > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14843) [R] Implement decimal128() (to replace decimal())
[ https://issues.apache.org/jira/browse/ARROW-14843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14843. Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11790 [https://github.com/apache/arrow/pull/11790] > [R] Implement decimal128() (to replace decimal()) > - > > Key: ARROW-14843 > URL: https://issues.apache.org/jira/browse/ARROW-14843 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14905) [C++] Enable CSV Writer to handle quoting
[ https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453003#comment-17453003 ] Johan Peltenburg edited comment on ARROW-14905 at 12/3/21, 1:29 PM: The default behavior is currently {{{}needed{}}}, so I'll stick to that as the default. In the case of {{{}all{}}}, it's necessary to make a decision about whether quotes should be inserted for nulls. What I'm currently brewing up doesn't insert them. But in this case I'd opt for calling this option {{all_valid}} , exposing to the user that only valid (non-null) values are quoted. The second possibility is related to ARROW-14903 where we can set a custom value for nulls. If we choose to insert quotes everywhere, even if the null value is empty, it will produce {{{}""{}}}. In the case of a custom null value as described in the other issue, it would also be enclosed in quotes. Drawback is that it then becomes indistinguishable from possible strings that contain the null value. was (Author: johanpel): The default behavior is currently {{{}needed{}}}, so I'll stick to that as the default. In the case of {{{}all{}}}, it's necessary to make a decision about whether quotes should be inserted for nulls. What I'm currently brewing up doesn't insert them. But in this case I'd opt for calling this option {{all_valid}} , exposing to the user that only valid (non-null) values are quoted. If we choose to insert them, even if the null value is empty, it will produce {{""}} when the quote style is set to {{{}all{}}}. This is slightly related to ARROW-14903, where in the case of a custom null value it would also be enclosed in quotes. Drawback is that it is indistinguishable from possible strings that contain the null value. > [C++] Enable CSV Writer to handle quoting > - > > Key: ARROW-14905 > URL: https://issues.apache.org/jira/browse/ARROW-14905 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Johan Peltenburg >Priority: Major > > This will allow more control over quoting. In {{readr::write_csv()}} > {{{}quote{}}} instructs on how to handle fields which contain characters that > need to be quoted: > * {{{}needed{}}}: only quote fields which need them > * {{{}all{}}}: quote all fields - I think this might be the implicit default > behaviour for {{write_csv_arrow()}} > * {{{}none{}}}: never quote fields -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14905) [C++] Enable CSV Writer to handle quoting
[ https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453003#comment-17453003 ] Johan Peltenburg commented on ARROW-14905: -- The default behavior is currently {{{}needed{}}}, so I'll stick to that as the default. In the case of {{{}all{}}}, it's necessary to make a decision about whether quotes should be inserted for nulls. What I'm currently brewing up doesn't insert them. But in this case I'd opt for calling this option {{all_valid}} , exposing to the user that only valid (non-null) values are quoted. If we choose to insert them, even if the null value is empty, it will produce {{""}} when the quote style is set to {{{}all{}}}. This is slightly related to ARROW-14903, where in the case of a custom null value it would also be enclosed in quotes. Drawback is that it is indistinguishable from possible strings that contain the null value. > [C++] Enable CSV Writer to handle quoting > - > > Key: ARROW-14905 > URL: https://issues.apache.org/jira/browse/ARROW-14905 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Johan Peltenburg >Priority: Major > > This will allow more control over quoting. In {{readr::write_csv()}} > {{{}quote{}}} instructs on how to handle fields which contain characters that > need to be quoted: > * {{{}needed{}}}: only quote fields which need them > * {{{}all{}}}: quote all fields - I think this might be the implicit default > behaviour for {{write_csv_arrow()}} > * {{{}none{}}}: never quote fields -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14977) [Python] Add a "made-up" feature for the guide tutorial
[ https://issues.apache.org/jira/browse/ARROW-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452993#comment-17452993 ] Alenka Frim edited comment on ARROW-14977 at 12/3/21, 1:13 PM: --- What do you think about this feature [~amol-]? was (Author: alenkaf): What do you think about this issue [~amol-]? > [Python] Add a "made-up" feature for the guide tutorial > --- > > Key: ARROW-14977 > URL: https://issues.apache.org/jira/browse/ARROW-14977 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Minor > Fix For: 7.0.0 > > > To create a guide tutorial for a simple Python feature I will add a made-up > function to `compute.py` in order to demonstrate the process of making a > first contribution. > The function will call `min_max` and increase the interval for 1 in both > directions. I would call the new function `tutorial_example`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14977) [Python] Add a "made-up" feature for the guide tutorial
[ https://issues.apache.org/jira/browse/ARROW-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452993#comment-17452993 ] Alenka Frim commented on ARROW-14977: - What do you think about this issue [~amol-]? > [Python] Add a "made-up" feature for the guide tutorial > --- > > Key: ARROW-14977 > URL: https://issues.apache.org/jira/browse/ARROW-14977 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Minor > Fix For: 7.0.0 > > > To create a guide tutorial for a simple Python feature I will add a made-up > function to `compute.py` in order to demonstrate the process of making a > first contribution. > The function will call `min_max` and increase the interval for 1 in both > directions. I would call the new function `tutorial_example`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14977) [Python] Add a "made-up" feature for the guide tutorial
Alenka Frim created ARROW-14977: --- Summary: [Python] Add a "made-up" feature for the guide tutorial Key: ARROW-14977 URL: https://issues.apache.org/jira/browse/ARROW-14977 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Alenka Frim Assignee: Alenka Frim Fix For: 7.0.0 To create a guide tutorial for a simple Python feature I will add a made-up function to `compute.py` in order to demonstrate the process of making a first contribution. The function will call `min_max` and increase the interval for 1 in both directions. I would call the new function `tutorial_example`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14445) [C++] Implement memory management for DataHolder
[ https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ocsa updated ARROW-14445: --- Description: Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use of memory resource manager. This manager will be used to decide when to spill to disk. (was: Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use of memory stats. One method that could be added is the one to compute the use of memory for ExecBatches, This method + ExecContext:MemoryPool stats will be used by the DataHolder to decide when to spill to disk. ) > [C++] Implement memory management for DataHolder > > > Key: ARROW-14445 > URL: https://issues.apache.org/jira/browse/ARROW-14445 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Assignee: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 7.0.0 > > > Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use > of memory resource manager. This manager will be used to decide when to > spill to disk. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14445) [C++] Implement memory stats for DataHolder
[ https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ocsa updated ARROW-14445: --- Description: Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use of memory stats. One method that could be added is the one to compute the use of memory for ExecBatches, This method + ExecContext:MemoryPool stats will be used by the DataHolder to decide when to spill to disk. (was: Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use of memory management. One method that could be added is the one to compute the use of memory for ExecBatches, This method + ExecContext:MemoryPool stats will be used by the DataHolder to decide when to spill to disk. ) > [C++] Implement memory stats for DataHolder > --- > > Key: ARROW-14445 > URL: https://issues.apache.org/jira/browse/ARROW-14445 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Assignee: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 7.0.0 > > > Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use > of memory stats. One method that could be added is the one to compute the > use of memory for ExecBatches, This method + ExecContext:MemoryPool stats > will be used by the DataHolder to decide when to spill to disk. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14445) [C++] Implement memory management for DataHolder
[ https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ocsa updated ARROW-14445: --- Summary: [C++] Implement memory management for DataHolder (was: [C++] Implement memory stats for DataHolder) > [C++] Implement memory management for DataHolder > > > Key: ARROW-14445 > URL: https://issues.apache.org/jira/browse/ARROW-14445 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Assignee: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 7.0.0 > > > Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use > of memory stats. One method that could be added is the one to compute the > use of memory for ExecBatches, This method + ExecContext:MemoryPool stats > will be used by the DataHolder to decide when to spill to disk. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14445) [C++] Implement memory stats for DataHolder
[ https://issues.apache.org/jira/browse/ARROW-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ocsa updated ARROW-14445: --- Description: Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use of memory management. One method that could be added is the one to compute the use of memory for ExecBatches, This method + ExecContext:MemoryPool stats will be used by the DataHolder to decide when to spill to disk. (was: Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use of memory stats. One method that could be added is the one to compute the use of memory for ExecBatches, This method + ExecContext:MemoryPool stats will be used by the DataHolder to decide when to spill to disk. ) > [C++] Implement memory stats for DataHolder > --- > > Key: ARROW-14445 > URL: https://issues.apache.org/jira/browse/ARROW-14445 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Assignee: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 7.0.0 > > > Issue https://issues.apache.org/jira/browse/ARROW-14330 will require the use > of memory management. One method that could be added is the one to compute > the use of memory for ExecBatches, This method + ExecContext:MemoryPool > stats will be used by the DataHolder to decide when to spill to disk. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14905) [C++] Enable CSV Writer to handle quoting
[ https://issues.apache.org/jira/browse/ARROW-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Peltenburg reassigned ARROW-14905: Assignee: Johan Peltenburg > [C++] Enable CSV Writer to handle quoting > - > > Key: ARROW-14905 > URL: https://issues.apache.org/jira/browse/ARROW-14905 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Johan Peltenburg >Priority: Major > > This will allow more control over quoting. In {{readr::write_csv()}} > {{{}quote{}}} instructs on how to handle fields which contain characters that > need to be quoted: > * {{{}needed{}}}: only quote fields which need them > * {{{}all{}}}: quote all fields - I think this might be the implicit default > behaviour for {{write_csv_arrow()}} > * {{{}none{}}}: never quote fields -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-1569) [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types
[ https://issues.apache.org/jira/browse/ARROW-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthijs Brobbel reassigned ARROW-1569: --- Assignee: Matthijs Brobbel > [C++] Kernel functions for determining monotonicity (ascending or descending) > for well-ordered types > > > Key: ARROW-1569 > URL: https://issues.apache.org/jira/browse/ARROW-1569 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Matthijs Brobbel >Priority: Major > Labels: Analytics > > These kernels must offer some stateful variant so that monotonicity can be > determined across chunked arrays -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14413) [C++][Gandiva] Implement levenshtein function
[ https://issues.apache.org/jira/browse/ARROW-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-14413. Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11522 [https://github.com/apache/arrow/pull/11522] > [C++][Gandiva] Implement levenshtein function > - > > Key: ARROW-14413 > URL: https://issues.apache.org/jira/browse/ARROW-14413 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Vinicius Souza Roque >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Returns the Levenshtein distance between two strings . For example, > levenshtein('kitten', 'sitting') results in 3. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14048) [C++][Gandiva] Cache only object code in memory instead of entire module
[ https://issues.apache.org/jira/browse/ARROW-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-14048. Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11193 [https://github.com/apache/arrow/pull/11193] > [C++][Gandiva] Cache only object code in memory instead of entire module > > > Key: ARROW-14048 > URL: https://issues.apache.org/jira/browse/ARROW-14048 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Augusto Alves Silva >Assignee: Projjal Chanda >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 12h 50m > Remaining Estimate: 0h > > Implement Gandiva to cache object code instead the entire llvm module, > improving the memory consumption and LLVM time perfomance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-12858) [C++][Gandiva] Add isNull, isTrue, isFalse, isNotTrue, IsNotFalse and NVL functions on Gandiva
[ https://issues.apache.org/jira/browse/ARROW-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-12858. Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 10385 [https://github.com/apache/arrow/pull/10385] > [C++][Gandiva] Add isNull, isTrue, isFalse, isNotTrue, IsNotFalse and NVL > functions on Gandiva > -- > > Key: ARROW-12858 > URL: https://issues.apache.org/jira/browse/ARROW-12858 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Rodrigo Jacomozzi de Bem >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14936) [C++][Gandiva] Fix split_part function in gandiva
[ https://issues.apache.org/jira/browse/ARROW-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-14936. Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11819 [https://github.com/apache/arrow/pull/11819] > [C++][Gandiva] Fix split_part function in gandiva > - > > Key: ARROW-14936 > URL: https://issues.apache.org/jira/browse/ARROW-14936 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > split_part function sporadically returns error with message "Couldn't > allocate output string" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14976) [Dev][Archery] Print more friendly error logs for non-existent benchmark
Yibo Cai created ARROW-14976: Summary: [Dev][Archery] Print more friendly error logs for non-existent benchmark Key: ARROW-14976 URL: https://issues.apache.org/jira/browse/ARROW-14976 Project: Apache Arrow Issue Type: Improvement Components: Archery Reporter: Yibo Cai For non-existent benchmark, archery benchmark tool outputs error messages not very useful. E.g., a typo (*parser* -> *parsing*) in command _"archery benchmark diff --suite-filter=arrow-csv-parsing-benchmark"_ leads to below confusing error message: {noformat} Traceback (most recent call last): File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1533, in pandas._libs.hashtable.Float64HashTable.get_item TypeError: must be real number, not str During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cyb/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc KeyError: 'change' {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14965) [Python][C++] Contention when reading Parquet files with multi-threading
[ https://issues.apache.org/jira/browse/ARROW-14965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452843#comment-17452843 ] Nick Gates commented on ARROW-14965: Really appreciate you looking into this - I will work to extract a minimal reproduction of this with executable code rather than the psuedo code above. > [Python][C++] Contention when reading Parquet files with multi-threading > > > Key: ARROW-14965 > URL: https://issues.apache.org/jira/browse/ARROW-14965 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 6.0.0 >Reporter: Nick Gates >Priority: Minor > > I'm attempting to read a table from multiple Parquet files where I already > know which row_groups I want to read from each file. I also want to apply a > filter expression while reading. To do this my code looks roughly like this: > > {code:java} > def read_file(filepath): > format = ds.ParquetFileFormat(...) > fragment = format.make_fragment(filepath, row_groups=[0, 1, 2, ...]) > scanner = ds.Scanner.from_fragment( > fragment, > use_threads=True, > use_async=False, > filter=... > ) > return scanner.to_reader().read_all() > with ThreadPoolExecutor() as pool: > pa.concat_tables(pool.map(read_file, file_paths)) {code} > Running with a ProcessPoolExecutor, each of my 13 read_file calls takes at > most 2 seconds. However, with a ThreadPoolExecutor some of the read_file > calls take 20+ seconds. > > I've tried running this with various combinations of use_threads and > use_async to try and see what's happening. The code blocks are sourced from > py-spy, and identifying contention was done with viztracer. > > *use_threads: False, use_async: False* > * It looks like pyarrow._dataset.Scanner.to_reader doesn't release the GIL: > [https://github.com/apache/arrow/blob/be9a22b9b76d9cd83d85d52ffc2844056d90f367/python/pyarrow/_dataset.pyx#L3278-L3283] > * pyarrow._dataset.from_fragment seems to be contended. Py-spy suggests this > is around getting the physical_schema from the fragment? > > {code:java} > from_fragment (pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so) > __pyx_getprop_7pyarrow_8_dataset_8Fragment_physical_schema > (pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so) > __pthread_cond_timedwait (libpthread-2.17.so) {code} > > *use_threads: False, use_async: True* > * There's no longer any contention for pyarrow._dataset.from_fragment > * But there's lots of contention for pyarrow.lib.RecordBatchReader.read_all > > {code:java} > arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600) > arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext > (pyarrow/libarrow_dataset.so.600) > arrow::Iterator::Next > > (pyarrow/libarrow_dataset.so.600) > arrow::FutureImpl::Wait (pyarrow/libarrow.so.600) > std::condition_variable::wait (libstdc++.so.6.0.19){code} > *use_threads: True, use_async: False* > * Appears to be some contention on Scanner.to_reader > * But most contention remains for RecordBatchReader.read_all > {code:java} > arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600) > arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext > (pyarrow/libarrow_dataset.so.600) > arrow::Iterator::Next > namespace)::SyncScanner::ScanBatches(arrow::Iterator > >)::{lambda()#1}, arrow::dataset::TaggedRecordBatch> > > (pyarrow/libarrow_dataset.so.600) > std::condition_variable::wait (libstdc++.so.6.0.19) > __pthread_cond_wait (libpthread-2.17.so) {code} > *use_threads: True, use_async: True* > * Contention again mostly for RecordBatchReader.read_all, but seems to > complete in ~12 seconds rather than 20 > {code:java} > arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600) > arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext > (pyarrow/libarrow_dataset.so.600) > arrow::Iterator::Next > > (pyarrow/libarrow_dataset.so.600) > arrow::FutureImpl::Wait (pyarrow/libarrow.so.600) > std::condition_variable::wait (libstdc++.so.6.0.19) > __pthread_cond_wait (libpthread-2.17.so) {code} > Is this expected behaviour? Or should it be possible to achieve the same > performance from multi-threading as from multi-processing? > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation
[ https://issues.apache.org/jira/browse/ARROW-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zixi Wang updated ARROW-14975: -- External issue URL: https://github.com/apache/arrow/pull/11848 > [C++][Docs] Typo in in emit_dictionary_deltas documentation > --- > > Key: ARROW-14975 > URL: https://issues.apache.org/jira/browse/ARROW-14975 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Zixi Wang >Priority: Trivial > Labels: pull-request-available > Fix For: 6.0.1 > > Time Spent: 10m > Remaining Estimate: 0h > > Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be > emitted, not omitted. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation
[ https://issues.apache.org/jira/browse/ARROW-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zixi Wang updated ARROW-14975: -- Labels: pull-request-available (was: ) > [C++][Docs] Typo in in emit_dictionary_deltas documentation > --- > > Key: ARROW-14975 > URL: https://issues.apache.org/jira/browse/ARROW-14975 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Zixi Wang >Priority: Trivial > Labels: pull-request-available > Fix For: 6.0.1 > > > Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be > emitted, not omitted. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation
[ https://issues.apache.org/jira/browse/ARROW-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zixi Wang updated ARROW-14975: -- Issue Type: Improvement (was: Bug) > [C++][Docs] Typo in in emit_dictionary_deltas documentation > --- > > Key: ARROW-14975 > URL: https://issues.apache.org/jira/browse/ARROW-14975 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Zixi Wang >Priority: Trivial > Fix For: 6.0.1 > > > Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be > emitted, not omitted. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14975) [C++][Docs] Typo in in emit_dictionary_deltas documentation
Zixi Wang created ARROW-14975: - Summary: [C++][Docs] Typo in in emit_dictionary_deltas documentation Key: ARROW-14975 URL: https://issues.apache.org/jira/browse/ARROW-14975 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation Reporter: Zixi Wang Fix For: 6.0.1 Typo in \arrow\cpp\src\arrow\ipc\options.h, in my understanding, it should be emitted, not omitted. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14737) [C++][Dataset] Support URI-decoding partition keys
[ https://issues.apache.org/jira/browse/ARROW-14737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akhil reassigned ARROW-14737: - Assignee: Akhil > [C++][Dataset] Support URI-decoding partition keys > -- > > Key: ARROW-14737 > URL: https://issues.apache.org/jira/browse/ARROW-14737 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: Akhil >Priority: Major > Labels: dataset, good-first-issue > > From [GitHub issue #11718|https://github.com/apache/arrow/issues/11718], > Delta Lake can URL-encode partition keys. We should add an option as we did > in ARROW-12644 to decode them as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)