[jira] [Created] (ARROW-13971) [C++][Compute] Improve top_k/bottom_k Selecters via CRTP
Alexander created ARROW-13971: - Summary: [C++][Compute] Improve top_k/bottom_k Selecters via CRTP Key: ARROW-13971 URL: https://issues.apache.org/jira/browse/ARROW-13971 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Alexander As it was mentioned here: [https://github.com/apache/arrow/pull/11019#discussion_r701349253] Selecters for SelectKUnstable all have a relatively similar core structure. It might be worth considering how some templating (e.g. via [CRTP|https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern], or via a set of helper comparator/iteration templates) could let you factor out the core algorithm and the container-specific bits. It would then be easier to also try to share generated code between types with the same physical type (e.g. as mentioned, Int64, Timestamp, and Date64 should all use the same generated code underneath). This issue related to create template specialization functions for these types was also mentioned here: https://github.com/apache/arrow/pull/11019#discussion_r700238908 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13970) [C++][Compute] Implement streaming version for SelectK
Alexander created ARROW-13970: - Summary: [C++][Compute] Implement streaming version for SelectK Key: ARROW-13970 URL: https://issues.apache.org/jira/browse/ARROW-13970 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Alexander Fix For: 6.0.0 PR [https://github.com/apache/arrow/pull/11019] implements SelectKUnstable. A streaming version of it using a Heap based solution seems to be the right direction as it was discussed here: https://github.com/apache/arrow/pull/11019#issuecomment-914419100 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13969) [C++][Compute] Implement SelectKStable
[ https://issues.apache.org/jira/browse/ARROW-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander reassigned ARROW-13969: - Assignee: (was: Alexander) > [C++][Compute] Implement SelectKStable > -- > > Key: ARROW-13969 > URL: https://issues.apache.org/jira/browse/ARROW-13969 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander >Priority: Major > Labels: analytics, query-engine > Fix For: 6.0.0 > > > PR [https://github.com/apache/arrow/pull/11019] implements SelectKUnstable. > > Some previous result of SelectKUnstable using StableHeap is shown here: > [https://github.com/apache/arrow/pull/11019#issuecomment-913977337] > > So, implementation of SelectKStable should explore how to implement this > algorithm using StablePartition + stable_sorting. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13969) [C++][Compute] Implement SelectKStable
Alexander created ARROW-13969: - Summary: [C++][Compute] Implement SelectKStable Key: ARROW-13969 URL: https://issues.apache.org/jira/browse/ARROW-13969 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Alexander Assignee: Alexander Fix For: 6.0.0 PR [https://github.com/apache/arrow/pull/11019] implements SelectKUnstable. Some previous result of SelectKUnstable using StableHeap is shown here: [https://github.com/apache/arrow/pull/11019#issuecomment-913977337] So, implementation of SelectKStable should explore how to implement this algorithm using StablePartition + stable_sorting. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1565) [C++][Compute] Implement TopK/BottomK
[ https://issues.apache.org/jira/browse/ARROW-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander updated ARROW-1565: - Summary: [C++][Compute] Implement TopK/BottomK (was: [C++][Compute] Implement TopK/BottomK streaming execution nodes) > [C++][Compute] Implement TopK/BottomK > - > > Key: ARROW-1565 > URL: https://issues.apache.org/jira/browse/ARROW-1565 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Alexander >Priority: Major > Labels: Analytics, pull-request-available, query-engine > Fix For: 6.0.0 > > Time Spent: 11.5h > Remaining Estimate: 0h > > Heap-based topk can compute these indices in O(n log k) time -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13965) [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance
[ https://issues.apache.org/jira/browse/ARROW-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412901#comment-17412901 ] Yibo Cai commented on ARROW-13965: -- This looks a nice improvement. Will you created a PR? Thanks. > [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance > -- > > Key: ARROW-13965 > URL: https://issues.apache.org/jira/browse/ARROW-13965 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Environment: arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS > 11.5.2 (clang 11.0.0) >Reporter: Edward Seidl >Priority: Minor > Attachments: arrow_downcast.patch > > > The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), > and WriteValuesSpaced() in TypedColumnWriterImpl > (cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ > object to either DictEncoder or ValueEncoderType pointers. When calling > WriteBatch() with a large number of values this is ok, but when writing > batches of 1 (as when using the stream api), these dynamic casts can consume > a great deal of cpu. Using gperftools against code I wrote to do a log > structured merge of several parquet files, I measured the dynamic_casts > taking as much as 25% of execution time. > By modifying TypedColumnWriterImpl to save downcasted observer pointers of > the appropriate types, I was able to cut my execution time from 32 to 24 > seconds, validating the gpertools results. I've attached a patch to show > what I did. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13942) [Dev] cmake_format autotune doesn't work
[ https://issues.apache.org/jira/browse/ARROW-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-13942. -- Fix Version/s: 6.0.0 Resolution: Fixed Issue resolved by pull request 2 [https://github.com/apache/arrow/pull/2] > [Dev] cmake_format autotune doesn't work > > > Key: ARROW-13942 > URL: https://issues.apache.org/jira/browse/ARROW-13942 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > https://github.com/apache/arrow/runs/3550654193?check_suite_focus=true > {noformat} > + python3 -m pip install -r dev/archery/requirements-lint.txt > Defaulting to user installation because normal site-packages is not writeable > ERROR: Could not open requirements file: [Errno 2] No such file or directory: > 'dev/archery/requirements-lint.txt' > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-13968) [R] [CI] Add r-lib actions-based UCRT CI job
[ https://issues.apache.org/jira/browse/ARROW-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson closed ARROW-13968. --- Resolution: Duplicate > [R] [CI] Add r-lib actions-based UCRT CI job > > > Key: ARROW-13968 > URL: https://issues.apache.org/jira/browse/ARROW-13968 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Priority: Major > > https://github.com/r-lib/actions/commit/a89f65b5ed2e9ad25e9518e2796604a1495a2c55 > has been added to r-lib/actions > an example: > https://github.com/jeroen/openssl/commit/fa15ce5ca57f8662d9aa07344b5447ea07457df4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13967) [Go] Implement Concatenate function for Arrays
[ https://issues.apache.org/jira/browse/ARROW-13967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-13967: -- Component/s: Go > [Go] Implement Concatenate function for Arrays > -- > > Key: ARROW-13967 > URL: https://issues.apache.org/jira/browse/ARROW-13967 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > This is needed for proper handling of MakeArrayFromScalar when dealing with > nested types, and likely could be useful in other use cases too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13967) [Go] Implement Concatenate function for Arrays
[ https://issues.apache.org/jira/browse/ARROW-13967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13967: --- Labels: pull-request-available (was: ) > [Go] Implement Concatenate function for Arrays > -- > > Key: ARROW-13967 > URL: https://issues.apache.org/jira/browse/ARROW-13967 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This is needed for proper handling of MakeArrayFromScalar when dealing with > nested types, and likely could be useful in other use cases too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13968) [R] [CI] Add r-lib actions-based UCRT CI job
Jonathan Keane created ARROW-13968: -- Summary: [R] [CI] Add r-lib actions-based UCRT CI job Key: ARROW-13968 URL: https://issues.apache.org/jira/browse/ARROW-13968 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane https://github.com/r-lib/actions/commit/a89f65b5ed2e9ad25e9518e2796604a1495a2c55 has been added to r-lib/actions an example: https://github.com/jeroen/openssl/commit/fa15ce5ca57f8662d9aa07344b5447ea07457df4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13967) [Go] Implement Concatenate function for Arrays
Matthew Topol created ARROW-13967: - Summary: [Go] Implement Concatenate function for Arrays Key: ARROW-13967 URL: https://issues.apache.org/jira/browse/ARROW-13967 Project: Apache Arrow Issue Type: Sub-task Reporter: Matthew Topol Assignee: Matthew Topol This is needed for proper handling of MakeArrayFromScalar when dealing with nested types, and likely could be useful in other use cases too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions
[ https://issues.apache.org/jira/browse/ARROW-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-13964. --- Fix Version/s: 6.0.0 Resolution: Fixed Issue resolved by pull request 11126 [https://github.com/apache/arrow/pull/11126] > [Go] Remove Parquet bitmap reader/writer implementations and use the shared > arrow bitutils versions > --- > > Key: ARROW-13964 > URL: https://issues.apache.org/jira/browse/ARROW-13964 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go, Parquet >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries
[ https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-13940. - Resolution: Fixed Issue resolved by pull request 8 [https://github.com/apache/arrow/pull/8] > [R] Turn on multithreading with Arrow engine queries > > > Key: ARROW-13940 > URL: https://issues.apache.org/jira/browse/ARROW-13940 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Critical > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x > longer on conbench > https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/ > I'm also seeing only one core utilized when running queries locally as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13962) [R] Catch up on the NEWS
[ https://issues.apache.org/jira/browse/ARROW-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-13962. - Resolution: Fixed Issue resolved by pull request 11122 [https://github.com/apache/arrow/pull/11122] > [R] Catch up on the NEWS > > > Key: ARROW-13962 > URL: https://issues.apache.org/jira/browse/ARROW-13962 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13877) [C++] Added support for fixed sized list to compute functions that process lists
[ https://issues.apache.org/jira/browse/ARROW-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13877: --- Labels: kernel pull-request-available types (was: kernel types) > [C++] Added support for fixed sized list to compute functions that process > lists > > > Key: ARROW-13877 > URL: https://issues.apache.org/jira/browse/ARROW-13877 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available, types > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The following functions do not support fixed size list (and should): > - list_flatten > - list_parent_indices (one could argue this doesn't need to be supported > since this should be obvious and fixed_size_list doesn't have an indices > array) > - list_value_length (should be easy) > For reference, the following functions do correctly consume fixed_size_list > (there may be more, this isn't an exhaustive list): > - count > - drop_null > - is_null > - is_valid -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython
[ https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412791#comment-17412791 ] Weston Pace commented on ARROW-13939: - The slice function should be O(1). It does not actually copy memory or create a new array. It simply creates a new view of the same data. The same goes for "table[x:y]" which calls slice under the hood. > how to do resampling of arrow table using cython > > > Key: ARROW-13939 > URL: https://issues.apache.org/jira/browse/ARROW-13939 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: krishna deepak >Priority: Minor > > Please can someone point me to resources, how to write a resampling code in > cython for Arrow table. > # Will iterating the whole table be slow in cython? > # which is the best to use to append new elements to. Is there a way i > create an empty table of same schema and keep appending to it. Or should I > use vectors/list and then pass them to create a table. > Performance is very important for me. Any help is highly appreciated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13966) [C++] Comparison kernel(s) for decimals
[ https://issues.apache.org/jira/browse/ARROW-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-13966: Assignee: David Li > [C++] Comparison kernel(s) for decimals > --- > > Key: ARROW-13966 > URL: https://issues.apache.org/jira/browse/ARROW-13966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jonathan Keane >Assignee: David Li >Priority: Major > Labels: kernel, types > > Even decimal-decimal comparisons return an error: > {code:r} > > Scalar$create(1.5)$cast(decimal(15, 2)) > > > Scalar$create(1.1)$cast(decimal(15, 2)) > Error: NotImplemented: Function greater has no kernel matching input types > (scalar[decimal128(15, 2)], scalar[decimal128(15, 2)]) > {code} > Ideally, we would also be able to (autocast in order to) compare > decimal-float or decimal-integer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13966) [C++] Comparison kernel(s) for decimals
[ https://issues.apache.org/jira/browse/ARROW-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-13966: - Labels: kernel types (was: ) > [C++] Comparison kernel(s) for decimals > --- > > Key: ARROW-13966 > URL: https://issues.apache.org/jira/browse/ARROW-13966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jonathan Keane >Priority: Major > Labels: kernel, types > > Even decimal-decimal comparisons return an error: > {code:r} > > Scalar$create(1.5)$cast(decimal(15, 2)) > > > Scalar$create(1.1)$cast(decimal(15, 2)) > Error: NotImplemented: Function greater has no kernel matching input types > (scalar[decimal128(15, 2)], scalar[decimal128(15, 2)]) > {code} > Ideally, we would also be able to (autocast in order to) compare > decimal-float or decimal-integer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13966) [C++] Comparison kernel(s) for decimals
Jonathan Keane created ARROW-13966: -- Summary: [C++] Comparison kernel(s) for decimals Key: ARROW-13966 URL: https://issues.apache.org/jira/browse/ARROW-13966 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jonathan Keane Even decimal-decimal comparisons return an error: {code:r} > Scalar$create(1.5)$cast(decimal(15, 2)) > Scalar$create(1.1)$cast(decimal(15, > 2)) Error: NotImplemented: Function greater has no kernel matching input types (scalar[decimal128(15, 2)], scalar[decimal128(15, 2)]) {code} Ideally, we would also be able to (autocast in order to) compare decimal-float or decimal-integer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13961) [C++] iso_calendar may be uninitialized
[ https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-13961. -- Resolution: Fixed Issue resolved by pull request 11121 [https://github.com/apache/arrow/pull/11121] > [C++] iso_calendar may be uninitialized > --- > > Key: ARROW-13961 > URL: https://issues.apache.org/jira/browse/ARROW-13961 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > {code} > /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ > may be used uninitialized in this function [-Wmaybe-uninitialized] > 137 | : PrimitiveScalarBase(std::move(type), true), value(value) {} > |^ > In file included from > /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7: > /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: > ‘*((void*)& iso_calendar +8)’ was declared here > 697 | std::array iso_calendar; > {code} > fyi [~rokm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13965) [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance
Edward Seidl created ARROW-13965: Summary: [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance Key: ARROW-13965 URL: https://issues.apache.org/jira/browse/ARROW-13965 Project: Apache Arrow Issue Type: Improvement Components: C++ Environment: arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 11.5.2 (clang 11.0.0) Reporter: Edward Seidl Attachments: arrow_downcast.patch The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), and WriteValuesSpaced() in TypedColumnWriterImpl (cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ object to either DictEncoder or ValueEncoderType pointers. When calling WriteBatch() with a large number of values this is ok, but when writing batches of 1 (as when using the stream api), these dynamic casts can consume a great deal of cpu. Using gperftools against code I wrote to do a log structured merge of several parquet files, I measured the dynamic_casts taking as much as 25% of execution time. By modifying TypedColumnWriterImpl to save downcasted observer pointers of the appropriate types, I was able to cut my execution time from 32 to 24 seconds, validating the gpertools results. I've attached a patch to show what I did. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13293) [R] open_dataset followed by collect hangs (while compute works)
[ https://issues.apache.org/jira/browse/ARROW-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane reassigned ARROW-13293: Assignee: (was: Nicola Crane) > [R] open_dataset followed by collect hangs (while compute works) > > > Key: ARROW-13293 > URL: https://issues.apache.org/jira/browse/ARROW-13293 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 4.0.1 > Environment: Windows 10 (see also session info included in reprex) >Reporter: Hans Van Calster >Priority: Minor > > Tried to make a reproducible example using the iris dataset, but it works as > expected for that dataset. So the issue might be specific to the dataset I am > using (which contains over 100 columns). The example below illustrates the > issue. > The parquet data used in the example can be downloaded from [this > link|https://drive.google.com/file/d/1MHaq3KqlheqrNm8dk71we74n_ip9hMqJ/view?usp=sharing] > > The issue I see is the following: > > * calling open_dataset() %>% filter() %>% collect() hangs on my machine > (while I would expect that a tibble 1,646 x 116 would be returned very fast) > * The two alternative calls (one using read_parquet on the specific parquet > file within the Dataset on which I filter, and the other using compute() > instead of collect()) seem to work as expected > > ``` r > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > read_parquet("data/lucas_harmonised/1_table/parquet_hive/year=2018/part-4.parquet") > %>% > filter(nuts1 == "BE2") > #> # A tibble: 1,646 x 116 > #> id point_id nuts0 nuts1 nuts2 nuts3 th_lat th_long office_pi ex_ante > #> > #> 1 199451 39803106 BE BE2 BE22 BE221 51.0 5.14 1 0 > #> 2 220669 39623116 BE BE2 BE21 BE213 51.0 4.88 1 0 > #> 3 215557 39483154 BE BE2 BE21 BE211 51.4 4.64 1 0 > #> 4 223579 40303122 BE BE2 BE22 BE222 51.1 5.84 1 0 > #> 5 331079 39783134 BE BE2 BE21 BE213 51.2 5.09 0 0 > #> 6 225417 39403150 BE BE2 BE21 BE211 51.3 4.53 1 0 > #> 7 3340 38863118 BE BE2 BE23 BE234 51.0 3.79 1 0 > #> 8 137361 38143132 BE BE2 BE25 BE258 51.1 2.75 1 0 > #> 9 221861 38343148 BE BE2 BE25 BE255 51.2 3.02 1 0 > #> 10 787 39523148 BE BE2 BE21 BE211 51.3 4.70 1 0 > #> # ... with 1,636 more rows, and 106 more variables: survey_date , > #> # car_latitude , car_ew , car_longitude , gps_proj , > #> # gps_prec , gps_altitude , gps_lat , gps_ew , > #> # gps_long , obs_dist , obs_direct , obs_type , > #> # obs_radius , letter_group , lc1 , lc1_label , > #> # lc1_spec , lc1_spec_label , lc1_perc , lc2 , > #> # lc2_label , lc2_spec , lc2_spec_label , lc2_perc , > #> # lu1 , lu1_label , lu1_type , lu1_type_label , > #> # lu1_perc , lu2 , lu2_label , lu2_type , > #> # lu2_type_label , lu2_perc , parcel_area_ha , > #> # tree_height_maturity , tree_height_survey , feature_width > , > #> # lm_stone_walls , crop_residues , lm_grass_margins , > #> # grazing , special_status , lc_lu_special_remark , > #> # cprn_cando , cprn_lc , cprn_lc_label , cprn_lc1n , > #> # cprnc_lc1e , cprnc_lc1s , cprnc_lc1w , > #> # cprn_lc1n_brdth , cprn_lc1e_brdth , cprn_lc1s_brdth , > #> # cprn_lc1w_brdth , cprn_lc1n_next , cprn_lc1s_next , > #> # cprn_lc1e_next , cprn_lc1w_next , cprn_urban , > #> # cprn_impervious_perc , inspire_plcc1 , inspire_plcc2 , > #> # inspire_plcc3 , inspire_plcc4 , inspire_plcc5 , > #> # inspire_plcc6 , inspire_plcc7 , inspire_plcc8 , > #> # eunis_complex , grassland_sample , grass_cando , wm , > #> # wm_source , wm_type , wm_delivery , erosion_cando , > #> # soil_stones_perc , bio_sample , soil_bio_taken , > #> # bulk0_10_sample , soil_blk_0_10_taken , bulk10_20_sample , > #> # soil_blk_10_20_taken , bulk20_30_sample , > #> # soil_blk_20_30_taken , standard_sample , soil_std_taken , > #> # organic_sample , soil_org_depth_cando , soil_taken , > #> # soil_crop , photo_point , photo_north , photo_south , > #> # photo_east , photo_west , transect , revisit , ... > open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>% > filter(nuts1 == "BE2", year == 2018) %>% > compute() > #> Table > #> 1646 rows x 117 columns > #> $id > #> $point_id > #> $nuts0 > #> $nuts1 > #> $nuts2 > #> $nuts3 > #> $th_lat > #> $th_long > #> $office_pi > #> $ex_ante > #> $survey_date > #> $car_latitude > #> $car_ew > #> $car_longitude > #> $gps_proj > #> $gps_prec > #> $gps_altitude > #> $gps_lat > #> $gps_ew > #> $gps_long > #> $obs_dist > #> $obs_direct > #> $obs_type > #>
[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython
[ https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412781#comment-17412781 ] krishna deepak commented on ARROW-13939: Please can you tell, what's the time complexity of cpp Slice function, and pyarrow table[x:y]? > how to do resampling of arrow table using cython > > > Key: ARROW-13939 > URL: https://issues.apache.org/jira/browse/ARROW-13939 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: krishna deepak >Priority: Minor > > Please can someone point me to resources, how to write a resampling code in > cython for Arrow table. > # Will iterating the whole table be slow in cython? > # which is the best to use to append new elements to. Is there a way i > create an empty table of same schema and keep appending to it. Or should I > use vectors/list and then pass them to create a table. > Performance is very important for me. Any help is highly appreciated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13877) [C++] Added support for fixed sized list to compute functions that process lists
[ https://issues.apache.org/jira/browse/ARROW-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-13877: Assignee: David Li > [C++] Added support for fixed sized list to compute functions that process > lists > > > Key: ARROW-13877 > URL: https://issues.apache.org/jira/browse/ARROW-13877 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: David Li >Priority: Major > Labels: kernel, types > Fix For: 6.0.0 > > > The following functions do not support fixed size list (and should): > - list_flatten > - list_parent_indices (one could argue this doesn't need to be supported > since this should be obvious and fixed_size_list doesn't have an indices > array) > - list_value_length (should be easy) > For reference, the following functions do correctly consume fixed_size_list > (there may be more, this isn't an exhaustive list): > - count > - drop_null > - is_null > - is_valid -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions
[ https://issues.apache.org/jira/browse/ARROW-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13964: --- Labels: pull-request-available (was: ) > [Go] Remove Parquet bitmap reader/writer implementations and use the shared > arrow bitutils versions > --- > > Key: ARROW-13964 > URL: https://issues.apache.org/jira/browse/ARROW-13964 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go, Parquet >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars
[ https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412766#comment-17412766 ] Weston Pace commented on ARROW-13954: - Thanks for pointing this out. You are correct, this is not about testing python's type support but actually testing the C++ compute kernels via python. I have attempted to clarify. > [Python] Extend compute kernel type testing to supply scalars > - > > Key: ARROW-13954 > URL: https://issues.apache.org/jira/browse/ARROW-13954 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > ARROW-13952 introduced testing for the various computer kernel signatures. > The current compute kernel type testing passes in all arguments as arrays. > We should extend it to account for cases where an argument is allowed to be a > scalar. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13955) [Python] Extend compute kernel type testing to cover kernels which take recordbatch / table input
[ https://issues.apache.org/jira/browse/ARROW-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13955: Description: ARROW-13952 introduced testing for the various computer kernel signatures. Some compute kernels (e.g. drop_nulls) have special handling for RecordBatch & Table. These are not covered by the first pass of type testing. (was: Some compute kernels (e.g. drop_nulls) have special handling for RecordBatch & Table. These are not covered by the first pass of type testing.) > [Python] Extend compute kernel type testing to cover kernels which take > recordbatch / table input > - > > Key: ARROW-13955 > URL: https://issues.apache.org/jira/browse/ARROW-13955 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > ARROW-13952 introduced testing for the various computer kernel signatures. > Some compute kernels (e.g. drop_nulls) have special handling for RecordBatch > & Table. These are not covered by the first pass of type testing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars
[ https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13954: Description: ARROW-13952 introduced testing for the various computer kernel signatures. The current compute kernel type testing passes in all arguments as arrays. We should extend it to account for cases where an argument is allowed to be a scalar. (was: The current compute kernel type testing passes in all arguments as arrays. We should extend it to account for cases where an argument is allowed to be a scalar.) > [Python] Extend compute kernel type testing to supply scalars > - > > Key: ARROW-13954 > URL: https://issues.apache.org/jira/browse/ARROW-13954 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > ARROW-13952 introduced testing for the various computer kernel signatures. > The current compute kernel type testing passes in all arguments as arrays. > We should extend it to account for cases where an argument is allowed to be a > scalar. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13955) [Python] Extend compute kernel type testing to cover kernels which take recordbatch / table input
[ https://issues.apache.org/jira/browse/ARROW-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13955: Description: Some compute kernels (e.g. drop_nulls) have special handling for RecordBatch & Table. These are not covered by the first pass of type testing. (was: Some kernels (e.g. drop_nulls) have special handling for RecordBatch & Table. These are not covered by the first pass of type testing.) > [Python] Extend compute kernel type testing to cover kernels which take > recordbatch / table input > - > > Key: ARROW-13955 > URL: https://issues.apache.org/jira/browse/ARROW-13955 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > Some compute kernels (e.g. drop_nulls) have special handling for RecordBatch > & Table. These are not covered by the first pass of type testing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars
[ https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13954: Summary: [Python] Extend compute kernel type testing to supply scalars (was: [Python] Extend type testing to supply scalars) > [Python] Extend compute kernel type testing to supply scalars > - > > Key: ARROW-13954 > URL: https://issues.apache.org/jira/browse/ARROW-13954 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > The current type testing passes in all arguments as arrays. We should extend > it to account for cases where an argument is allowed to be a scalar. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13955) [Python] Extend compute kernel type testing to cover kernels which take recordbatch / table input
[ https://issues.apache.org/jira/browse/ARROW-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13955: Summary: [Python] Extend compute kernel type testing to cover kernels which take recordbatch / table input (was: [Python] Extend type testing to cover kernels which take recordbatch / table input) > [Python] Extend compute kernel type testing to cover kernels which take > recordbatch / table input > - > > Key: ARROW-13955 > URL: https://issues.apache.org/jira/browse/ARROW-13955 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > Some kernels (e.g. drop_nulls) have special handling for RecordBatch & Table. > These are not covered by the first pass of type testing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package
[ https://issues.apache.org/jira/browse/ARROW-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-13963. --- Fix Version/s: 6.0.0 Resolution: Fixed Issue resolved by pull request 11124 [https://github.com/apache/arrow/pull/11124] > [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil > package > > > Key: ARROW-13963 > URL: https://issues.apache.org/jira/browse/ARROW-13963 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Move the implementations of the BitmapReader/Writers from an internal Parquet > module to the arrow bitutil package in order to share them between the Arrow > and Parquet repos. > This covers the Arrow side of adding the implementations, it will be followed > by a change to the parquet module to remove them and point to the merged > arrow utils. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars
[ https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13954: Description: The current compute kernel type testing passes in all arguments as arrays. We should extend it to account for cases where an argument is allowed to be a scalar. (was: The current type testing passes in all arguments as arrays. We should extend it to account for cases where an argument is allowed to be a scalar.) > [Python] Extend compute kernel type testing to supply scalars > - > > Key: ARROW-13954 > URL: https://issues.apache.org/jira/browse/ARROW-13954 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > The current compute kernel type testing passes in all arguments as arrays. > We should extend it to account for cases where an argument is allowed to be a > scalar. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13953) [Python] Extend compute kernel type testing to also test for union types
[ https://issues.apache.org/jira/browse/ARROW-13953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13953: Description: ARROW-13952 introduced testing for the various computer kernel signatures. Many kernels likely do not support union types (e.g. arithmetic or string kernels) but there are a number of kernels that should operate on any possible input type (e.g. drop_null, filter, take) and we should verify these kernels work correctly with union types. > [Python] Extend compute kernel type testing to also test for union types > > > Key: ARROW-13953 > URL: https://issues.apache.org/jira/browse/ARROW-13953 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > ARROW-13952 introduced testing for the various computer kernel signatures. > Many kernels likely do not support union types (e.g. arithmetic or string > kernels) but there are a number of kernels that should operate on any > possible input type (e.g. drop_null, filter, take) and we should verify these > kernels work correctly with union types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13953) [Python] Extend compute kernel type testing to also test for union types
[ https://issues.apache.org/jira/browse/ARROW-13953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13953: Summary: [Python] Extend compute kernel type testing to also test for union types (was: [Python] Extend type testing to also test for union types) > [Python] Extend compute kernel type testing to also test for union types > > > Key: ARROW-13953 > URL: https://issues.apache.org/jira/browse/ARROW-13953 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13952) [Python] Add initial type testing for compute kernels
[ https://issues.apache.org/jira/browse/ARROW-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13952: Summary: [Python] Add initial type testing for compute kernels (was: [Python] Add initial type testing for kernels) > [Python] Add initial type testing for compute kernels > - > > Key: ARROW-13952 > URL: https://issues.apache.org/jira/browse/ARROW-13952 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > We need tests that ensure that we are supporting all types that should be > supported for the various compute kernels. I've created a first pass at this > (and filed a number of JIRAs for places we are missing support). This PR is > to get the test itself upstreamed into Arrow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13952) [Python] Add initial type testing for compute kernels
[ https://issues.apache.org/jira/browse/ARROW-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-13952: Description: We need tests that ensure that we are supporting all types that should be supported for the various compute kernels. For example, arithmetic kernels should support any combination of numeric inputs. I've created a first pass at this (and filed a number of JIRAs for places we are missing support). This PR is to get the test itself upstreamed into Arrow. (was: We need tests that ensure that we are supporting all types that should be supported for the various compute kernels. I've created a first pass at this (and filed a number of JIRAs for places we are missing support). This PR is to get the test itself upstreamed into Arrow.) > [Python] Add initial type testing for compute kernels > - > > Key: ARROW-13952 > URL: https://issues.apache.org/jira/browse/ARROW-13952 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > We need tests that ensure that we are supporting all types that should be > supported for the various compute kernels. For example, arithmetic kernels > should support any combination of numeric inputs. I've created a first pass > at this (and filed a number of JIRAs for places we are missing support). > This PR is to get the test itself upstreamed into Arrow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13398) [R] Update install.Rmd vignette
[ https://issues.apache.org/jira/browse/ARROW-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-13398: - Description: Proposed changes: * Break up further to make more skimmable, by using more subheadings * Add flowchart to "how dependencies are resolved" section was: Proposed changes: * Break up further to make more skimmable, by using more subheadings * Add flowchart to "how dependencies are resolved" section * make sure the instructions on how to install from the nightly build set {{repos}} to a vector including the current default so R dependencies will be installed too > [R] Update install.Rmd vignette > --- > > Key: ARROW-13398 > URL: https://issues.apache.org/jira/browse/ARROW-13398 > Project: Apache Arrow > Issue Type: Sub-task > Components: Documentation, R >Reporter: Nicola Crane >Priority: Major > > Proposed changes: > * Break up further to make more skimmable, by using more subheadings > * Add flowchart to "how dependencies are resolved" section -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13842) [C++] Bump vendored date library version
[ https://issues.apache.org/jira/browse/ARROW-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-13842. Resolution: Fixed Issue resolved by pull request 7 [https://github.com/apache/arrow/pull/7] > [C++] Bump vendored date library version > > > Key: ARROW-13842 > URL: https://issues.apache.org/jira/browse/ARROW-13842 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > This fix: [https://github.com/HowardHinnant/date/issues/696] > should let us always re-enable this test: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/pretty_print_test.cc#L454-L466 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13398) [R] Update install.Rmd vignette
[ https://issues.apache.org/jira/browse/ARROW-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-13398: - Description: Proposed changes: * Break up further to make more skimmable, by using more subheadings * Add flowchart to "how dependencies are resolved" section * make sure the instructions on how to install from the nightly build set {{repos}} to a vector so dependencies will be installed was: Proposed changes: * Break up further to make more skimmable, by using more subheadings * Add flowchart to "how dependencies are resolved" section > [R] Update install.Rmd vignette > --- > > Key: ARROW-13398 > URL: https://issues.apache.org/jira/browse/ARROW-13398 > Project: Apache Arrow > Issue Type: Sub-task > Components: Documentation, R >Reporter: Nicola Crane >Priority: Major > > Proposed changes: > * Break up further to make more skimmable, by using more subheadings > * Add flowchart to "how dependencies are resolved" section > * make sure the instructions on how to install from the nightly build set > {{repos}} to a vector so dependencies will be installed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13398) [R] Update install.Rmd vignette
[ https://issues.apache.org/jira/browse/ARROW-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-13398: - Description: Proposed changes: * Break up further to make more skimmable, by using more subheadings * Add flowchart to "how dependencies are resolved" section * make sure the instructions on how to install from the nightly build set {{repos}} to a vector including the current default so R dependencies will be installed too was: Proposed changes: * Break up further to make more skimmable, by using more subheadings * Add flowchart to "how dependencies are resolved" section * make sure the instructions on how to install from the nightly build set {{repos}} to a vector so dependencies will be installed > [R] Update install.Rmd vignette > --- > > Key: ARROW-13398 > URL: https://issues.apache.org/jira/browse/ARROW-13398 > Project: Apache Arrow > Issue Type: Sub-task > Components: Documentation, R >Reporter: Nicola Crane >Priority: Major > > Proposed changes: > * Break up further to make more skimmable, by using more subheadings > * Add flowchart to "how dependencies are resolved" section > * make sure the instructions on how to install from the nightly build set > {{repos}} to a vector including the current default so R dependencies will be > installed too -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions
[ https://issues.apache.org/jira/browse/ARROW-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-13964: -- Component/s: Parquet Go > [Go] Remove Parquet bitmap reader/writer implementations and use the shared > arrow bitutils versions > --- > > Key: ARROW-13964 > URL: https://issues.apache.org/jira/browse/ARROW-13964 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go, Parquet >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package
[ https://issues.apache.org/jira/browse/ARROW-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-13963: -- Component/s: Go > [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil > package > > > Key: ARROW-13963 > URL: https://issues.apache.org/jira/browse/ARROW-13963 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Move the implementations of the BitmapReader/Writers from an internal Parquet > module to the arrow bitutil package in order to share them between the Arrow > and Parquet repos. > This covers the Arrow side of adding the implementations, it will be followed > by a change to the parquet module to remove them and point to the merged > arrow utils. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package
[ https://issues.apache.org/jira/browse/ARROW-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13963: --- Labels: pull-request-available (was: ) > [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil > package > > > Key: ARROW-13963 > URL: https://issues.apache.org/jira/browse/ARROW-13963 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Move the implementations of the BitmapReader/Writers from an internal Parquet > module to the arrow bitutil package in order to share them between the Arrow > and Parquet repos. > This covers the Arrow side of adding the implementations, it will be followed > by a change to the parquet module to remove them and point to the merged > arrow utils. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions
Matthew Topol created ARROW-13964: - Summary: [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions Key: ARROW-13964 URL: https://issues.apache.org/jira/browse/ARROW-13964 Project: Apache Arrow Issue Type: Sub-task Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package
Matthew Topol created ARROW-13963: - Summary: [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package Key: ARROW-13963 URL: https://issues.apache.org/jira/browse/ARROW-13963 Project: Apache Arrow Issue Type: Sub-task Reporter: Matthew Topol Assignee: Matthew Topol Move the implementations of the BitmapReader/Writers from an internal Parquet module to the arrow bitutil package in order to share them between the Arrow and Parquet repos. This covers the Arrow side of adding the implementations, it will be followed by a change to the parquet module to remove them and point to the merged arrow utils. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14
[ https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13655: --- Labels: pull-request-available (was: ) > [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" > error with Thrift 0.14 > -- > > Key: ARROW-13655 > URL: https://issues.apache.org/jira/browse/ARROW-13655 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 5.0.1, 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > From https://github.com/dask/dask/issues/8027 > Apache Thrift introduced a `MaxMessageSize` configuration option > (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) > in version 0.14 (THRIFT-5237). > I think this is the cause of an issue reported originally at > https://github.com/dask/dask/issues/8027, where one can get a _"OSError: > Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a > large Parquet (metadata-only) file. > In the original report, the file was writting using the python fastparquet > library (which uses the python thrift bindings, which still use Thrift 0.13), > but I was able to construct a reproducible code example with pyarrow. > Create a large metadata Parquet file with pyarrow in an environment with > Arrow built against Thrift 0.13 (eg with a local install from source, or > installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13): > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({str(i): np.random.randn(10) for i in range(1_000)}) > pq.write_table(table, "__temp_file_for_metadata.parquet") > metadata = pq.read_metadata("__temp_file_for_metadata.parquet") > metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet") > [metadata.append_row_groups(metadata2) for _ in range(4000)] > metadata.write_metadata_file("test_parquet_metadata_large_file.parquet") > {code} > And then reading this file again in the same environment works fine, but > reading it in an environment with recent Thrift 0.14 (eg installing latest > pyarrow with conda-forge) gives the following error: > {code:python} > In [1]: import pyarrow.parquet as pq > In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet") > ... > OSError: Couldn't deserialize thrift: MaxMessageSize reached > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13959) [R] Update tests for extracting components from date32 objects
[ https://issues.apache.org/jira/browse/ARROW-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-13959: Fix Version/s: 6.0.0 > [R] Update tests for extracting components from date32 objects > --- > > Key: ARROW-13959 > URL: https://issues.apache.org/jira/browse/ARROW-13959 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The R tests implemented in the PR which adds C++ functionality for extracting > components from date32 objects don't compare Arrow dplyr code with R dplyr > code - these tests should be updated to do so. > https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith
[ https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-13924: Fix Version/s: 6.0.0 Labels: good-first-issue kernel (was: good-first-issue) > [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and > base::endsWith > > > Key: ARROW-13924 > URL: https://issues.apache.org/jira/browse/ARROW-13924 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Minor > Labels: good-first-issue, kernel > Fix For: 6.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries
[ https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-13940: Fix Version/s: 6.0.0 > [R] Turn on multithreading with Arrow engine queries > > > Key: ARROW-13940 > URL: https://issues.apache.org/jira/browse/ARROW-13940 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x > longer on conbench > https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/ > I'm also seeing only one core utilized when running queries locally as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13927) [R] Add Karl to the contributors list for the pacakge
[ https://issues.apache.org/jira/browse/ARROW-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-13927: Fix Version/s: 6.0.0 > [R] Add Karl to the contributors list for the pacakge > - > > Key: ARROW-13927 > URL: https://issues.apache.org/jira/browse/ARROW-13927 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Fix For: 6.0.0 > > > [~karldw] : As recognition of the contributions you have made, especially the > herculean effort with ARROW-12981 we would like to add you to the > contributors list of the R package. > If you are ok with this, would you mind giving us the email you would like to > have listed there (and ORCID, if you have one and want it there as well)? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14
[ https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13655: --- Fix Version/s: 5.0.1 > [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" > error with Thrift 0.14 > -- > > Key: ARROW-13655 > URL: https://issues.apache.org/jira/browse/ARROW-13655 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Fix For: 5.0.1, 6.0.0 > > > From https://github.com/dask/dask/issues/8027 > Apache Thrift introduced a `MaxMessageSize` configuration option > (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) > in version 0.14 (THRIFT-5237). > I think this is the cause of an issue reported originally at > https://github.com/dask/dask/issues/8027, where one can get a _"OSError: > Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a > large Parquet (metadata-only) file. > In the original report, the file was writting using the python fastparquet > library (which uses the python thrift bindings, which still use Thrift 0.13), > but I was able to construct a reproducible code example with pyarrow. > Create a large metadata Parquet file with pyarrow in an environment with > Arrow built against Thrift 0.13 (eg with a local install from source, or > installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13): > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({str(i): np.random.randn(10) for i in range(1_000)}) > pq.write_table(table, "__temp_file_for_metadata.parquet") > metadata = pq.read_metadata("__temp_file_for_metadata.parquet") > metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet") > [metadata.append_row_groups(metadata2) for _ in range(4000)] > metadata.write_metadata_file("test_parquet_metadata_large_file.parquet") > {code} > And then reading this file again in the same environment works fine, but > reading it in an environment with recent Thrift 0.14 (eg installing latest > pyarrow with conda-forge) gives the following error: > {code:python} > In [1]: import pyarrow.parquet as pq > In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet") > ... > OSError: Couldn't deserialize thrift: MaxMessageSize reached > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11885) [R] Turn off some capabilities when LIBARROW_MINIMAL=true
[ https://issues.apache.org/jira/browse/ARROW-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-11885. - Fix Version/s: 6.0.0 Resolution: Fixed Issue resolved by pull request 11109 [https://github.com/apache/arrow/pull/11109] > [R] Turn off some capabilities when LIBARROW_MINIMAL=true > - > > Key: ARROW-11885 > URL: https://issues.apache.org/jira/browse/ARROW-11885 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Currently when {{LIBARROW_MINIMAL}} takes a value other than {{false}}, the > Arrow C++ library is built with mimalloc, S3, and compression algos turned > off. Consider whether to turn off some other capabilities when > {{LIBARROW_MINIMAL}} is explicitly set to {{true}}, including Arrow Dataset > and Parquet. > The code that controls this is in {{r/inst/build_arrow_static.sh}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13962) [R] Catch up on the NEWS
[ https://issues.apache.org/jira/browse/ARROW-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13962: --- Labels: pull-request-available (was: ) > [R] Catch up on the NEWS > > > Key: ARROW-13962 > URL: https://issues.apache.org/jira/browse/ARROW-13962 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray
[ https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Percy Camilo Triveño Aucahuasi reassigned ARROW-12669: -- Assignee: Percy Camilo Triveño Aucahuasi > [C++] Kernel to return Array of elements at index of list in ListArray > -- > > Key: ARROW-12669 > URL: https://issues.apache.org/jira/browse/ARROW-12669 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ian Cook >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Critical > Labels: kernel, types > Fix For: 6.0.0 > > > It would be useful to have a compute function that takes a > {{ListArray>}} and an integer index {{n}} and returns an > {{Array}} containing the {{n}} th item from each list. > This would be useful in combination with existing functions that return > list-type output, for example the string splitting functions. > Let's please ensure that this also works on fixed size list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray
[ https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Percy Camilo Triveño Aucahuasi updated ARROW-12669: --- Labels: Kernels kernel types (was: kernel types) > [C++] Kernel to return Array of elements at index of list in ListArray > -- > > Key: ARROW-12669 > URL: https://issues.apache.org/jira/browse/ARROW-12669 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ian Cook >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Critical > Labels: Kernels, kernel, types > Fix For: 6.0.0 > > > It would be useful to have a compute function that takes a > {{ListArray>}} and an integer index {{n}} and returns an > {{Array}} containing the {{n}} th item from each list. > This would be useful in combination with existing functions that return > list-type output, for example the string splitting functions. > Let's please ensure that this also works on fixed size list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray
[ https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce reassigned ARROW-12669: - Assignee: (was: Eduardo Ponce) > [C++] Kernel to return Array of elements at index of list in ListArray > -- > > Key: ARROW-12669 > URL: https://issues.apache.org/jira/browse/ARROW-12669 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ian Cook >Priority: Critical > Labels: kernel, types > Fix For: 6.0.0 > > > It would be useful to have a compute function that takes a > {{ListArray>}} and an integer index {{n}} and returns an > {{Array}} containing the {{n}} th item from each list. > This would be useful in combination with existing functions that return > list-type output, for example the string splitting functions. > Let's please ensure that this also works on fixed size list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13718) [Doc][Cookbook] Creating Arrays - R
[ https://issues.apache.org/jira/browse/ARROW-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13718: --- Labels: pull-request-available (was: ) > [Doc][Cookbook] Creating Arrays - R > --- > > Key: ARROW-13718 > URL: https://issues.apache.org/jira/browse/ARROW-13718 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Alessandro Molina >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13709) [Doc][Cookbook] Reading JSON Files - R
[ https://issues.apache.org/jira/browse/ARROW-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13709: --- Labels: pull-request-available (was: ) > [Doc][Cookbook] Reading JSON Files - R > -- > > Key: ARROW-13709 > URL: https://issues.apache.org/jira/browse/ARROW-13709 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Alessandro Molina >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray
[ https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated ARROW-12669: - Priority: Critical (was: Major) > [C++] Kernel to return Array of elements at index of list in ListArray > -- > > Key: ARROW-12669 > URL: https://issues.apache.org/jira/browse/ARROW-12669 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ian Cook >Assignee: Eduardo Ponce >Priority: Critical > Labels: kernel, types > Fix For: 6.0.0 > > > It would be useful to have a compute function that takes a > {{ListArray>}} and an integer index {{n}} and returns an > {{Array}} containing the {{n}} th item from each list. > This would be useful in combination with existing functions that return > list-type output, for example the string splitting functions. > Let's please ensure that this also works on fixed size list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14
[ https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-13655: -- Assignee: Antoine Pitrou > [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" > error with Thrift 0.14 > -- > > Key: ARROW-13655 > URL: https://issues.apache.org/jira/browse/ARROW-13655 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Fix For: 6.0.0 > > > From https://github.com/dask/dask/issues/8027 > Apache Thrift introduced a `MaxMessageSize` configuration option > (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) > in version 0.14 (THRIFT-5237). > I think this is the cause of an issue reported originally at > https://github.com/dask/dask/issues/8027, where one can get a _"OSError: > Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a > large Parquet (metadata-only) file. > In the original report, the file was writting using the python fastparquet > library (which uses the python thrift bindings, which still use Thrift 0.13), > but I was able to construct a reproducible code example with pyarrow. > Create a large metadata Parquet file with pyarrow in an environment with > Arrow built against Thrift 0.13 (eg with a local install from source, or > installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13): > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({str(i): np.random.randn(10) for i in range(1_000)}) > pq.write_table(table, "__temp_file_for_metadata.parquet") > metadata = pq.read_metadata("__temp_file_for_metadata.parquet") > metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet") > [metadata.append_row_groups(metadata2) for _ in range(4000)] > metadata.write_metadata_file("test_parquet_metadata_large_file.parquet") > {code} > And then reading this file again in the same environment works fine, but > reading it in an environment with recent Thrift 0.14 (eg installing latest > pyarrow with conda-forge) gives the following error: > {code:python} > In [1]: import pyarrow.parquet as pq > In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet") > ... > OSError: Couldn't deserialize thrift: MaxMessageSize reached > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13961) [C++] iso_calendar may be uninitialized
[ https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412658#comment-17412658 ] Neal Richardson commented on ARROW-13961: - https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=11314=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=6c939d89-0d1a-51f2-8b30-091a7a82e98c=461 > [C++] iso_calendar may be uninitialized > --- > > Key: ARROW-13961 > URL: https://issues.apache.org/jira/browse/ARROW-13961 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > {code} > /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ > may be used uninitialized in this function [-Wmaybe-uninitialized] > 137 | : PrimitiveScalarBase(std::move(type), true), value(value) {} > |^ > In file included from > /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7: > /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: > ‘*((void*)& iso_calendar +8)’ was declared here > 697 | std::array iso_calendar; > {code} > fyi [~rokm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14
[ https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13655: --- Fix Version/s: 6.0.0 > [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" > error with Thrift 0.14 > -- > > Key: ARROW-13655 > URL: https://issues.apache.org/jira/browse/ARROW-13655 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 6.0.0 > > > From https://github.com/dask/dask/issues/8027 > Apache Thrift introduced a `MaxMessageSize` configuration option > (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) > in version 0.14 (THRIFT-5237). > I think this is the cause of an issue reported originally at > https://github.com/dask/dask/issues/8027, where one can get a _"OSError: > Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a > large Parquet (metadata-only) file. > In the original report, the file was writting using the python fastparquet > library (which uses the python thrift bindings, which still use Thrift 0.13), > but I was able to construct a reproducible code example with pyarrow. > Create a large metadata Parquet file with pyarrow in an environment with > Arrow built against Thrift 0.13 (eg with a local install from source, or > installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13): > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({str(i): np.random.randn(10) for i in range(1_000)}) > pq.write_table(table, "__temp_file_for_metadata.parquet") > metadata = pq.read_metadata("__temp_file_for_metadata.parquet") > metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet") > [metadata.append_row_groups(metadata2) for _ in range(4000)] > metadata.write_metadata_file("test_parquet_metadata_large_file.parquet") > {code} > And then reading this file again in the same environment works fine, but > reading it in an environment with recent Thrift 0.14 (eg installing latest > pyarrow with conda-forge) gives the following error: > {code:python} > In [1]: import pyarrow.parquet as pq > In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet") > ... > OSError: Couldn't deserialize thrift: MaxMessageSize reached > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13961) [C++] iso_calendar may be uninitialized
[ https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13961: --- Labels: pull-request-available (was: ) > [C++] iso_calendar may be uninitialized > --- > > Key: ARROW-13961 > URL: https://issues.apache.org/jira/browse/ARROW-13961 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {code} > /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ > may be used uninitialized in this function [-Wmaybe-uninitialized] > 137 | : PrimitiveScalarBase(std::move(type), true), value(value) {} > |^ > In file included from > /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7: > /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: > ‘*((void*)& iso_calendar +8)’ was declared here > 697 | std::array iso_calendar; > {code} > fyi [~rokm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-13960) [C++] Fix use of non-const references in temporal kernels
[ https://issues.apache.org/jira/browse/ARROW-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li closed ARROW-13960. Resolution: Duplicate > [C++] Fix use of non-const references in temporal kernels > - > > Key: ARROW-13960 > URL: https://issues.apache.org/jira/browse/ARROW-13960 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel > > See final comments in https://github.com/apache/arrow/pull/11075 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13033) [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)
[ https://issues.apache.org/jira/browse/ARROW-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-13033. Resolution: Fixed Issue resolved by pull request 10610 [https://github.com/apache/arrow/pull/10610] > [C++] Kernel to localize naive timestamps to a timezone (preserving > clock-time) > --- > > Key: ARROW-13033 > URL: https://issues.apache.org/jira/browse/ARROW-13033 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Rok Mihevc >Priority: Major > Labels: pull-request-available, timestamp, timezone > Fix For: 6.0.0 > > Time Spent: 15.5h > Remaining Estimate: 0h > > Given a tz-naive timestamp, "localize" would interpret that timestamp as > local in a given timezone, and return a tz-aware timestamp keeping the same > "clock time" (the same year/month/day/hour/etc in the printed > representation). Under the hood this converts the timestamp value from that > timezone to UTC, since tz-aware timestamps are stored as UTC. > References: > [tz_localize|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.tz_localize.html] > in pandas, or > [force_tz|https://lubridate.tidyverse.org/reference/force_tz.html] in R's > lubridate package > This will (eventually) also have to deal with ambiguous or non-existing times. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13549) [C++] Implement timestamp to date/time cast that extracts value
[ https://issues.apache.org/jira/browse/ARROW-13549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-13549: - Description: Change casting from timestamp to date/time to extract the value, instead of just truncating as we currently do (which rounds, giving incorrect answers, in some cases). This should also be a safe cast by default (unless you want to do something like cast from timestamp[ns] to time32[s] which may overflow). This should behave like Postgres DATE/CAST(... as TIME), or Pandas Timestamp.date/Timestamp.time. was: Add a kernel that can extract just the date or the time from a timestamp. This should behave like Postgres DATE/CAST(... as TIME), or Pandas Timestamp.date/Timestamp.time. Extracting the date appears to be doable with an unsafe cast, but it might be more convenient to have an explicit kernel (and an unsafe cast, at least in the Python bindings, disables all checks, not just the check we care about). > [C++] Implement timestamp to date/time cast that extracts value > --- > > Key: ARROW-13549 > URL: https://issues.apache.org/jira/browse/ARROW-13549 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Change casting from timestamp to date/time to extract the value, instead of > just truncating as we currently do (which rounds, giving incorrect answers, > in some cases). This should also be a safe cast by default (unless you want > to do something like cast from timestamp[ns] to time32[s] which may overflow). > This should behave like Postgres DATE/CAST(... as TIME), or Pandas > Timestamp.date/Timestamp.time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13549) [C++] Implement timestamp to date/time cast that extracts value
[ https://issues.apache.org/jira/browse/ARROW-13549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-13549: - Summary: [C++] Implement timestamp to date/time cast that extracts value (was: [C++] Implement kernel to extract date/time from timestamp) > [C++] Implement timestamp to date/time cast that extracts value > --- > > Key: ARROW-13549 > URL: https://issues.apache.org/jira/browse/ARROW-13549 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Add a kernel that can extract just the date or the time from a timestamp. > This should behave like Postgres DATE/CAST(... as TIME), or Pandas > Timestamp.date/Timestamp.time. > Extracting the date appears to be doable with an unsafe cast, but it might be > more convenient to have an explicit kernel (and an unsafe cast, at least in > the Python bindings, disables all checks, not just the check we care about). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10213) [C++] Temporal cast from timestamp to date rounds instead of extracting date component
[ https://issues.apache.org/jira/browse/ARROW-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li closed ARROW-10213. Fix Version/s: 6.0.0 Assignee: David Li Resolution: Duplicate In ARROW-13549 we're implementing a timestamp->datea cast which extracts the components instead of truncating. Closing this as a duplicate. > [C++] Temporal cast from timestamp to date rounds instead of extracting date > component > -- > > Key: ARROW-10213 > URL: https://issues.apache.org/jira/browse/ARROW-10213 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: David Li >Assignee: David Li >Priority: Minor > Fix For: 6.0.0 > > > I'd expect this code to give 1950-01-01 twice (i.e. a timestamp -> date cast > extracts the date component, ignoring the time component): > {code:python} > import datetime > import pyarrow as pa > arr = pa.array([ > datetime.datetime(1950, 1, 1, 0, 0, 0), > datetime.datetime(1950, 1, 1, 12, 0, 0), > ], type=pa.timestamp("ns")) > print(arr) > print(arr.cast(pa.date32(), safe=False)) {code} > However it gives 1950-01-02 in the second case: > {noformat} > [ > 1950-01-01 00:00:00.0, > 1950-01-01 12:00:00.0 > ] > [ > 1950-01-01, > 1950-01-02 > ] > {noformat} > The reason is that the temporal cast simply divides, and C truncates towards > 0 (note: Python truncates towards -Infinity, so it would give the right > answer in this case!), resulting in -7304 days instead of -7305. > Depending on the intended semantics of a temporal cast, either it should be > fixed to extract the date component, or the rounding behavior should be noted > and a separate kernel should be implemented for extracting the date component. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13961) [C++] iso_calendar may be uninitialized
[ https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-13961: Assignee: David Li > [C++] iso_calendar may be uninitialized > --- > > Key: ARROW-13961 > URL: https://issues.apache.org/jira/browse/ARROW-13961 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Assignee: David Li >Priority: Major > Fix For: 6.0.0 > > > {code} > /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ > may be used uninitialized in this function [-Wmaybe-uninitialized] > 137 | : PrimitiveScalarBase(std::move(type), true), value(value) {} > |^ > In file included from > /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7: > /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: > ‘*((void*)& iso_calendar +8)’ was declared here > 697 | std::array iso_calendar; > {code} > fyi [~rokm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13960) [C++] Fix use of non-const references in temporal kernels
[ https://issues.apache.org/jira/browse/ARROW-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-13960: Assignee: David Li > [C++] Fix use of non-const references in temporal kernels > - > > Key: ARROW-13960 > URL: https://issues.apache.org/jira/browse/ARROW-13960 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel > > See final comments in https://github.com/apache/arrow/pull/11075 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13958) [Python] Migrate Python ORC bindings to use new Result-based APIs
[ https://issues.apache.org/jira/browse/ARROW-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-13958: - Assignee: Joris Van den Bossche > [Python] Migrate Python ORC bindings to use new Result-based APIs > - > > Key: ARROW-13958 > URL: https://issues.apache.org/jira/browse/ARROW-13958 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Needed follow-up on ARROW-13793 (currently compiling pyarrow gives > deprecation warnings about it) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13958) [Python] Migrate Python ORC bindings to use new Result-based APIs
[ https://issues.apache.org/jira/browse/ARROW-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13958: --- Labels: pull-request-available (was: ) > [Python] Migrate Python ORC bindings to use new Result-based APIs > - > > Key: ARROW-13958 > URL: https://issues.apache.org/jira/browse/ARROW-13958 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Needed follow-up on ARROW-13793 (currently compiling pyarrow gives > deprecation warnings about it) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13959) [R] Update tests for extracting components from date32 objects
[ https://issues.apache.org/jira/browse/ARROW-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13959: --- Labels: pull-request-available (was: ) > [R] Update tests for extracting components from date32 objects > --- > > Key: ARROW-13959 > URL: https://issues.apache.org/jira/browse/ARROW-13959 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The R tests implemented in the PR which adds C++ functionality for extracting > components from date32 objects don't compare Arrow dplyr code with R dplyr > code - these tests should be updated to do so. > https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13961) [C++] iso_calendar may be uninitialized
[ https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412576#comment-17412576 ] David Li commented on ARROW-13961: -- I just merged something here from [~aucahuasi] that may be relevant. (I could take care of this when I go back and fix ARROW-13960 as well since it's all around the same code.) > [C++] iso_calendar may be uninitialized > --- > > Key: ARROW-13961 > URL: https://issues.apache.org/jira/browse/ARROW-13961 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > Fix For: 6.0.0 > > > {code} > /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ > may be used uninitialized in this function [-Wmaybe-uninitialized] > 137 | : PrimitiveScalarBase(std::move(type), true), value(value) {} > |^ > In file included from > /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7: > /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: > ‘*((void*)& iso_calendar +8)’ was declared here > 697 | std::array iso_calendar; > {code} > fyi [~rokm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13962) [R] Catch up on the NEWS
Neal Richardson created ARROW-13962: --- Summary: [R] Catch up on the NEWS Key: ARROW-13962 URL: https://issues.apache.org/jira/browse/ARROW-13962 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 6.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13961) [C++] iso_calendar may be uninitialized
Neal Richardson created ARROW-13961: --- Summary: [C++] iso_calendar may be uninitialized Key: ARROW-13961 URL: https://issues.apache.org/jira/browse/ARROW-13961 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson Fix For: 6.0.0 {code} /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ may be used uninitialized in this function [-Wmaybe-uninitialized] 137 | : PrimitiveScalarBase(std::move(type), true), value(value) {} |^ In file included from /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7: /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: ‘*((void*)& iso_calendar +8)’ was declared here 697 | std::array iso_calendar; {code} fyi [~rokm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13959) [R] Update tests for extracting components from date32 objects
[ https://issues.apache.org/jira/browse/ARROW-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane reassigned ARROW-13959: Assignee: Nicola Crane > [R] Update tests for extracting components from date32 objects > --- > > Key: ARROW-13959 > URL: https://issues.apache.org/jira/browse/ARROW-13959 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > > The R tests implemented in the PR which adds C++ functionality for extracting > components from date32 objects don't compare Arrow dplyr code with R dplyr > code - these tests should be updated to do so. > https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries
[ https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13940: --- Labels: pull-request-available (was: ) > [R] Turn on multithreading with Arrow engine queries > > > Key: ARROW-13940 > URL: https://issues.apache.org/jira/browse/ARROW-13940 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x > longer on conbench > https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/ > I'm also seeing only one core utilized when running queries locally as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries
[ https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-13940: Summary: [R] Turn on multithreading with Arrow engine queries (was: [R] Multi-threading with Arrow engine queries?) > [R] Turn on multithreading with Arrow engine queries > > > Key: ARROW-13940 > URL: https://issues.apache.org/jira/browse/ARROW-13940 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > > Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x > longer on conbench > https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/ > I'm also seeing only one core utilized when running queries locally as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13960) [C++] Fix use of non-const references in temporal kernels
David Li created ARROW-13960: Summary: [C++] Fix use of non-const references in temporal kernels Key: ARROW-13960 URL: https://issues.apache.org/jira/browse/ARROW-13960 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li See final comments in https://github.com/apache/arrow/pull/11075 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13959) [R] Update tests for extracting components from date32 objects
Nicola Crane created ARROW-13959: Summary: [R] Update tests for extracting components from date32 objects Key: ARROW-13959 URL: https://issues.apache.org/jira/browse/ARROW-13959 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane The R tests implemented in the PR which adds C++ functionality for extracting components from date32 objects don't compare Arrow dplyr code with R dplyr code - these tests should be updated to do so. https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13940) [R] Multi-threading with Arrow engine queries?
[ https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-13940: --- Assignee: Neal Richardson > [R] Multi-threading with Arrow engine queries? > -- > > Key: ARROW-13940 > URL: https://issues.apache.org/jira/browse/ARROW-13940 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > > Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x > longer on conbench > https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/ > I'm also seeing only one core utilized when running queries locally as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13842) [C++] Bump vendored date library version
[ https://issues.apache.org/jira/browse/ARROW-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13842: --- Labels: pull-request-available (was: ) > [C++] Bump vendored date library version > > > Key: ARROW-13842 > URL: https://issues.apache.org/jira/browse/ARROW-13842 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This fix: [https://github.com/HowardHinnant/date/issues/696] > should let us always re-enable this test: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/pretty_print_test.cc#L454-L466 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13957) [C++] Make Windows S3FileSystem/Minio tests more reliable
[ https://issues.apache.org/jira/browse/ARROW-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412547#comment-17412547 ] Antoine Pitrou commented on ARROW-13957: One possibility would be to generate a different bucket name for each test, to make them more independent of each other. > [C++] Make Windows S3FileSystem/Minio tests more reliable > - > > Key: ARROW-13957 > URL: https://issues.apache.org/jira/browse/ARROW-13957 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > > [Example > log|https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/40696885/job/5t25hl7biwxdipe9] > {noformat} > [ RUN ] TestS3FS.FileSystemFromUri > WARNING: maximum file descriptor limit 0 is too low for production servers. > At least 4096 is recommended. Fix with "ulimit -n 4096" > C:/projects/arrow/cpp/src/arrow/filesystem/s3fs_test.cc(387): error: Failed > 'OutcomeToStatus(client_->CreateBucket(req))' failed with IOError: AWS Error > [code 130]: Your previous request to create the named bucket succeeded and > you already own it. > C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete > temporary directory: IOError: Cannot delete directory entry > 'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-s6295hb6/.minio.sys/tmp/3cb9aaa7-6716-4c53-a30e-c2348f122148': > . Detail: [Windows error 145] The directory is not empty. > [ FAILED ] TestS3FS.FileSystemFromUri (7172 ms) > [ RUN ] TestS3FS.CustomRetryStrategy > WARNING: maximum file descriptor limit 0 is too low for production servers. > At least 4096 is recommended. Fix with "ulimit -n 4096" > C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete > temporary directory: IOError: Cannot delete directory entry > 'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-wm32qa0y/.minio.sys': . > Detail: [Windows error 145] The directory is not empty. > [ OK ] TestS3FS.CustomRetryStrategy (814 ms) > [--] 23 tests from TestS3FS (51710 ms total) {noformat} > The tests are quite slow, and it seems in part because the bucket is being > recreated/deleted on every test; also because some things seem to be > eventually consistent(?) so we aren't cleaning files up properly. > It would also be nice here if the error from CreateBucket contained the > bucket name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13842) [C++] Bump vendored date library version
[ https://issues.apache.org/jira/browse/ARROW-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-13842: -- Assignee: Antoine Pitrou > [C++] Bump vendored date library version > > > Key: ARROW-13842 > URL: https://issues.apache.org/jira/browse/ARROW-13842 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: 6.0.0 > > > This fix: [https://github.com/HowardHinnant/date/issues/696] > should let us always re-enable this test: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/pretty_print_test.cc#L454-L466 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13958) [Python] Migrate Python ORC bindings to use new Result-based APIs
Joris Van den Bossche created ARROW-13958: - Summary: [Python] Migrate Python ORC bindings to use new Result-based APIs Key: ARROW-13958 URL: https://issues.apache.org/jira/browse/ARROW-13958 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 6.0.0 Needed follow-up on ARROW-13793 (currently compiling pyarrow gives deprecation warnings about it) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13138) [C++] Implement kernel to extract datetime components (year, month, day, etc) from date type objects
[ https://issues.apache.org/jira/browse/ARROW-13138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-13138. -- Resolution: Fixed Issue resolved by pull request 11075 [https://github.com/apache/arrow/pull/11075] > [C++] Implement kernel to extract datetime components (year, month, day, etc) > from date type objects > > > Key: ARROW-13138 > URL: https://issues.apache.org/jira/browse/ARROW-13138 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Nicola Crane >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > Labels: Kernels, kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > ARROW-11759 implemented extraction of datetime components for timestamp > objects; please can we have the equivalent extraction functions implemented > for date objects too? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13957) [C++] Make Windows S3FileSystem/Minio tests more reliable
David Li created ARROW-13957: Summary: [C++] Make Windows S3FileSystem/Minio tests more reliable Key: ARROW-13957 URL: https://issues.apache.org/jira/browse/ARROW-13957 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li [Example log|https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/40696885/job/5t25hl7biwxdipe9] {noformat} [ RUN ] TestS3FS.FileSystemFromUri WARNING: maximum file descriptor limit 0 is too low for production servers. At least 4096 is recommended. Fix with "ulimit -n 4096" C:/projects/arrow/cpp/src/arrow/filesystem/s3fs_test.cc(387): error: Failed 'OutcomeToStatus(client_->CreateBucket(req))' failed with IOError: AWS Error [code 130]: Your previous request to create the named bucket succeeded and you already own it. C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete temporary directory: IOError: Cannot delete directory entry 'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-s6295hb6/.minio.sys/tmp/3cb9aaa7-6716-4c53-a30e-c2348f122148': . Detail: [Windows error 145] The directory is not empty. [ FAILED ] TestS3FS.FileSystemFromUri (7172 ms) [ RUN ] TestS3FS.CustomRetryStrategy WARNING: maximum file descriptor limit 0 is too low for production servers. At least 4096 is recommended. Fix with "ulimit -n 4096" C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete temporary directory: IOError: Cannot delete directory entry 'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-wm32qa0y/.minio.sys': . Detail: [Windows error 145] The directory is not empty. [ OK ] TestS3FS.CustomRetryStrategy (814 ms) [--] 23 tests from TestS3FS (51710 ms total) {noformat} The tests are quite slow, and it seems in part because the bucket is being recreated/deleted on every test; also because some things seem to be eventually consistent(?) so we aren't cleaning files up properly. It would also be nice here if the error from CreateBucket contained the bucket name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13680) [C++] Create an asynchronous nursery to simplify capture logic
[ https://issues.apache.org/jira/browse/ARROW-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-13680. -- Resolution: Fixed Issue resolved by pull request 11084 [https://github.com/apache/arrow/pull/11084] > [C++] Create an asynchronous nursery to simplify capture logic > -- > > Key: ARROW-13680 > URL: https://issues.apache.org/jira/browse/ARROW-13680 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available, query-engine > Fix For: 6.0.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > The asynchronous nursery manages a set of asynchronous tasks and objects. > The nursery will not exit until all of those tasks have finished. This > allows one to safely capture fields for asynchronous callbacks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13882) [C++] Add compute function min_max support for more types
[ https://issues.apache.org/jira/browse/ARROW-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412530#comment-17412530 ] David Li commented on ARROW-13882: -- The linked PR already adds string support too (for the scalar aggregate only, not for the hash aggregate which I want to split out). I don't think std::minmax is relevant. > [C++] Add compute function min_max support for more types > - > > Key: ARROW-13882 > URL: https://issues.apache.org/jira/browse/ARROW-13882 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available, types > Time Spent: 20m > Remaining Estimate: 0h > > The min_max compute function does not support the following types but should: > - time32 > - time64 > - timestamp > - null > - binary > - large_binary > - fixed_size_binary > - string > - large_string -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12755) [C++][Compute] Add quotient and modulo kernels
[ https://issues.apache.org/jira/browse/ARROW-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12755: --- Labels: beginner kernel pull-request-available (was: beginner kernel) > [C++][Compute] Add quotient and modulo kernels > -- > > Key: ARROW-12755 > URL: https://issues.apache.org/jira/browse/ARROW-12755 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Eduardo Ponce >Priority: Major > Labels: beginner, kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Add a pair of binary kernels to compute the: > * quotient (result after division, discarding any fractional part, a.k.a > integer division) > * mod or modulo (remainder after division, a.k.a {{%}} / {{%%}} / modulus). > The returned array should have the same data type as the input arrays or > promote to an appropriate type to avoid loss of precision if the input types > differ. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-13898) [C++][Compute] Add support for string binary transforms
[ https://issues.apache.org/jira/browse/ARROW-13898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce closed ARROW-13898. - Resolution: Abandoned > [C++][Compute] Add support for string binary transforms > --- > > Key: ARROW-13898 > URL: https://issues.apache.org/jira/browse/ARROW-13898 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Eduardo Ponce >Assignee: Eduardo Ponce >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Add kernel exec generator for string binary functions (similar to > StringTransformExecBase) that always expect the first parameter to be of > string type and the second parameter is generic. It should also support > scalar and array inputs for the following common cases: > * Scalar, Scalar > * Array, Scalar - scalar is broadcasted and paired with all values from array > * Array, Array - process arrays element-wise > The Scalar, Array case is not necessary as it is difficult to generalize, and > there are not many functions with this pattern. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython
[ https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412481#comment-17412481 ] krishna deepak commented on ARROW-13939: Hi Will, Yes, resampling timeseries table, eg 1min buckets to 5 min buckets table etc. Same as dataframe.resample. Does arrow provide this functionality already? So how to go about iterating the table. from documentation all I could use is only *Slice* function. But does not feel like a proper iterator of sorts. Is there anything better to iterate properly? By "build arrays", you mean arrow chunkedarrays, arraybuilders or cpp vectors? Thanks > how to do resampling of arrow table using cython > > > Key: ARROW-13939 > URL: https://issues.apache.org/jira/browse/ARROW-13939 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: krishna deepak >Priority: Minor > > Please can someone point me to resources, how to write a resampling code in > cython for Arrow table. > # Will iterating the whole table be slow in cython? > # which is the best to use to append new elements to. Is there a way i > create an empty table of same schema and keep appending to it. Or should I > use vectors/list and then pass them to create a table. > Performance is very important for me. Any help is highly appreciated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10415) [R] Support for dplyr::distinct()
[ https://issues.apache.org/jira/browse/ARROW-10415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane reassigned ARROW-10415: Assignee: Nicola Crane > [R] Support for dplyr::distinct() > - > > Key: ARROW-10415 > URL: https://issues.apache.org/jira/browse/ARROW-10415 > Project: Apache Arrow > Issue Type: Wish > Components: R >Affects Versions: 2.0.0 >Reporter: Christian M >Assignee: Nicola Crane >Priority: Minor > Labels: dplyr, query-engine > Fix For: 6.0.0 > > Attachments: image-2020-10-28-15-01-54-198.png > > > It would be nice if dplyr::distinct worked with arrow tables: > > !image-2020-10-28-15-01-54-198.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12084) [C++][Compute] Add remainder and quotient compute::Function
[ https://issues.apache.org/jira/browse/ARROW-12084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12084: --- Labels: kernel pull-request-available (was: kernel) > [C++][Compute] Add remainder and quotient compute::Function > --- > > Key: ARROW-12084 > URL: https://issues.apache.org/jira/browse/ARROW-12084 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Eduardo Ponce >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In addition to {{divide}} which returns only the quotient, it'd be useful to > have a function which returns both quotient and remainder (these are > efficient to compute simultaneously), probably as a {{struct remainder: T>}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13954) [Python] Extend type testing to supply scalars
[ https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412420#comment-17412420 ] Antoine Pitrou commented on ARROW-13954: Can you update these JIRAs to be more precise about what "type testing" is? Right now I'm thinking {{test_types.py}}. > [Python] Extend type testing to supply scalars > -- > > Key: ARROW-13954 > URL: https://issues.apache.org/jira/browse/ARROW-13954 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: kernel, query-engine > > The current type testing passes in all arguments as arrays. We should extend > it to account for cases where an argument is allowed to be a scalar. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13956) [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the Status
[ https://issues.apache.org/jira/browse/ARROW-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412386#comment-17412386 ] Junwang Zhao commented on ARROW-13956: -- hi [~apitrou], could you pls review this issue, I came up with this when I see this kind of using in [0], thought it might be better if we provide this macro. [0]: https://github.com/apache/arrow/pull/10991/files/5ca6aa26b84d9fd89384032383c36bd48259e60b# > [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the > Status > --- > > Key: ARROW-13956 > URL: https://issues.apache.org/jira/browse/ARROW-13956 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Junwang Zhao >Assignee: Junwang Zhao >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As Result is encourged to be used, it's better to supply a Marcro to > change the internal Status. We could do this by using RETURN_NOT_OK_ELSE, > one example is: > > {code:java} > auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool); > RETURN_NOT_OK_ELSE(reader, _s.WithMessage("Could not open ORC input source > '", source.path(), "': ", _s.message())); > return reader; > {code} > but it's ugly since it use the _s of the macro. > > Recommend fix: > Add a macro RETURN_NOT_OK_ELSE_WITH_STATUS: > {code:java} > // Use this when you want to change the status in else_ expr > #define RETURN_NOT_OK_ELSE_WITH_STATUS(s, _s, else_)\ > do { \ > ::arrow::Status _s = ::arrow::internal::GenericToStatus(s); \ > if (!_s.ok()) { \ > else_;\ > return _s;\ > } \ > } while (false) > {code} > And the following statements are more natural > > {code:java} > auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool); > RETURN_NOT_OK_ELSE_WITH_STATUS(reader, status, status.WithMessage("Could not > open ORC input source '", source.path(), "': ", status.message())); > return reader; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13956) [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the Status
[ https://issues.apache.org/jira/browse/ARROW-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13956: --- Labels: pull-request-available (was: ) > [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the > Status > --- > > Key: ARROW-13956 > URL: https://issues.apache.org/jira/browse/ARROW-13956 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Junwang Zhao >Assignee: Junwang Zhao >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As Result is encourged to be used, it's better to supply a Marcro to > change the internal Status. We could do this by using RETURN_NOT_OK_ELSE, > one example is: > > {code:java} > auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool); > RETURN_NOT_OK_ELSE(reader, _s.WithMessage("Could not open ORC input source > '", source.path(), "': ", _s.message())); > return reader; > {code} > but it's ugly since it use the _s of the macro. > > Recommend fix: > Add a macro RETURN_NOT_OK_ELSE_WITH_STATUS: > {code:java} > // Use this when you want to change the status in else_ expr > #define RETURN_NOT_OK_ELSE_WITH_STATUS(s, _s, else_)\ > do { \ > ::arrow::Status _s = ::arrow::internal::GenericToStatus(s); \ > if (!_s.ok()) { \ > else_;\ > return _s;\ > } \ > } while (false) > {code} > And the following statements are more natural > > {code:java} > auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool); > RETURN_NOT_OK_ELSE_WITH_STATUS(reader, status, status.WithMessage("Could not > open ORC input source '", source.path(), "': ", status.message())); > return reader; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)