[jira] [Comment Edited] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632598#comment-17632598 ] Vibhatha Lakmal Abeykoon edited comment on ARROW-15716 at 11/12/22 3:50 AM: Yeah, that is true since it is always the equality operator. But for other comparator operators it won't. So it is better to fuse it at parse rather than at to_table. Because when filtering range of values as follows. {code:python} import pyarrow.dataset as ds df = pd.DataFrame({'a' : [1, 2, 1, 2, 3, 4, 5, 1, 2, 4, 7, 8], 'b' : [10, 30, 20, 40, 50, 60, 30, 50, 60, 10, 11, 12]}) table = pa.Table.from_pandas(df) path = tempdir / 'partitioning' collector = [] ds.write_dataset( table, base_dir=path, format="parquet", partitioning=["a"], partitioning_flavor="hive", file_visitor=lambda x: collector.append(x) ) paths = [file.path for file in collector] partitioning = ds.partitioning(flavor="hive") dataset = ds.dataset(source=path, partitioning=partitioning) filter_expressions = [dataset.partitioning.parse(path) for path in paths] f11 = ds.field("a") > pc.scalar(3) f22 = ds.field("a") < pc.scalar(6) f3 = f11 & f22 print(f3) new_table = dataset.to_table(filter=f3) print(table.to_pandas()) print("-" * 80) print(new_table.to_pandas()) {code} was (Author: vibhatha): Yeah, that is true since it is always the equality operator. But for other comparator operators it won't. So it is better to fuse it at parse rather than at to_table. Because when filtering range of values as follows. {code:python} import pyarrow.dataset as ds df = pd.DataFrame({'a' : [1, 2, 1, 2, 3, 4, 5, 1, 2, 4, 7, 8], 'b' : [10, 30, 20, 40, 50, 60, 30, 50, 60, 10, 11, 12]}) table = pa.Table.from_pandas(df) path = tempdir / 'partitioning' collector = [] ds.write_dataset( table, base_dir=path, format="parquet", partitioning=["a"], partitioning_flavor="hive", file_visitor=lambda x: collector.append(x) ) paths = [file.path for file in collector] partitioning = ds.partitioning(flavor="hive") dataset = ds.dataset(source=path, partitioning=partitioning) filter_expressions = [dataset.partitioning.parse(path) for path in paths] f1 = ds.field("a") > pc.scalar(3) f2 = ds.field("a") < pc.scalar(8) f11 = ds.field("a") > pc.scalar(3) f22 = ds.field("a") < pc.scalar(6) f3 = f11 & f22 print(f3) new_table = dataset.to_table(filter=f3) print(table.to_pandas()) print("-" * 80) print(new_table.to_pandas()) {code} > [Dataset][Python] Parse a list of fragment paths to gather filters > -- > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 7.0.0 >Reporter: Lance Dacey >Assignee: Vibhatha Lakmal Abeykoon >Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [, > , > , > ] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632598#comment-17632598 ] Vibhatha Lakmal Abeykoon commented on ARROW-15716: -- Yeah, that is true since it is always the equality operator. But for other comparator operators it won't. So it is better to fuse it at parse rather than at to_table. Because when filtering range of values as follows. {code:python} import pyarrow.dataset as ds df = pd.DataFrame({'a' : [1, 2, 1, 2, 3, 4, 5, 1, 2, 4, 7, 8], 'b' : [10, 30, 20, 40, 50, 60, 30, 50, 60, 10, 11, 12]}) table = pa.Table.from_pandas(df) path = tempdir / 'partitioning' collector = [] ds.write_dataset( table, base_dir=path, format="parquet", partitioning=["a"], partitioning_flavor="hive", file_visitor=lambda x: collector.append(x) ) paths = [file.path for file in collector] partitioning = ds.partitioning(flavor="hive") dataset = ds.dataset(source=path, partitioning=partitioning) filter_expressions = [dataset.partitioning.parse(path) for path in paths] f1 = ds.field("a") > pc.scalar(3) f2 = ds.field("a") < pc.scalar(8) f11 = ds.field("a") > pc.scalar(3) f22 = ds.field("a") < pc.scalar(6) f3 = f11 & f22 print(f3) new_table = dataset.to_table(filter=f3) print(table.to_pandas()) print("-" * 80) print(new_table.to_pandas()) {code} > [Dataset][Python] Parse a list of fragment paths to gather filters > -- > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 7.0.0 >Reporter: Lance Dacey >Assignee: Vibhatha Lakmal Abeykoon >Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [, > , > , > ] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16340) [C++][Python] Move all Python related code into PyArrow
[ https://issues.apache.org/jira/browse/ARROW-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632551#comment-17632551 ] Kouhei Sutou commented on ARROW-16340: -- Because pyarrow wheel includes pre-built Apache Arrow C++ library. If you use both of Apache Arrow C++ from vcpkg and pyarrow wheel from PyPI, you mix multiple Apache Arrow C++ libraries. It causes unexpected behavior such as a crash. > [C++][Python] Move all Python related code into PyArrow > --- > > Key: ARROW-16340 > URL: https://issues.apache.org/jira/browse/ARROW-16340 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 33h 10m > Remaining Estimate: 0h > > Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to > build it. > More details can be found on this thread: > https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18314) "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R
[ https://issues.apache.org/jira/browse/ARROW-18314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Mation updated ARROW-18314: - Description: This is running on a windows environment, arrow 10.0.0 (see arrow_info() below) I issued two calls ``` ft <- path_to_dataset1 fa <- path_to_dataset2 #1) tic() d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect toc() 927.11 sec elapsed #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min #1) tic() d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect terminate called after throwing an instance of 'cpp11::unwind_exception' ``` Then I got an error that craspad_hendler.exe stopped working. And R becomes frozen, after a while R crashed too. !image-2022-11-11-14-59-30-132.png! arrow_info() Arrow package version: 10.0.0 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE Arrow options(): arrow.use_threads FALSE Memory: Allocator mimalloc Current 0 bytes Max 0 bytes Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 10.0.0 C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0 was: I issued two calls ``` ft <- path_to_dataset1 fa <- path_to_dataset2 #1) tic() d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect toc() 927.11 sec elapsed #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min #1) tic() d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect terminate called after throwing an instance of 'cpp11::unwind_exception' ``` Then I got an error that craspad_hendler.exe stopped working. And R becomes frozen, after a while R crashed too. !image-2022-11-11-14-59-30-132.png! > "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes > CPP11::unwind_execption, crashed R > -- > > Key: ARROW-18314 > URL: https://issues.apache.org/jira/browse/ARROW-18314 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > Attachments: image-2022-11-11-14-55-36-430.png, > image-2022-11-11-14-59-30-132.png > > > This is running on a windows environment, arrow 10.0.0 (see arrow_info() > below) > I issued two calls > ``` > ft <- path_to_dataset1 > fa <- path_to_dataset2 > #1) > tic() > d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect > toc() > 927.11 sec elapsed > #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min > #1) > tic() > d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect > terminate called after throwing an instance of 'cpp11::unwind_exception' > ``` > Then I got an error that craspad_hendler.exe stopped working. And R becomes > frozen, after a while R crashed too. > !image-2022-11-11-14-59-30-132.png! > > arrow_info() > Arrow package version: 10.0.0 > Capabilities: > > dataset TRUE > substrait FALSE > parquet TRUE > json TRUE > s3 TRUE > gcs TRUE > utf8proc TRUE > re2 TRUE > snappy TRUE > gzip TRUE > brotli TRUE > zstd TRUE > lz4 TRUE > lz4_frame TRUE > lzo FALSE > bz2 TRUE > jemalloc FALSE > mimalloc TRUE > Arrow options(): > > arrow.use_threads FALSE > Memory: > > Allocator mimalloc > Current 0 bytes > Max 0 bytes > Runtime: > > SIMD Level avx2 > Detected SIMD Level avx2 > Build: > > C++ Library Version 10.0.0 > C++ Compiler GNU > C++ Compiler Version 10.3.0 > Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0 > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18314) "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R
[ https://issues.apache.org/jira/browse/ARROW-18314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Mation updated ARROW-18314: - Description: I issued two calls ``` ft <- path_to_dataset1 fa <- path_to_dataset2 #1) tic() d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect toc() 927.11 sec elapsed #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min #1) tic() d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect terminate called after throwing an instance of 'cpp11::unwind_exception' ``` Then I got an error that craspad_hendler.exe stopped working. And R becomes frozen, after a while R crashed too. !image-2022-11-11-14-59-30-132.png! was: I issued two calls ``` ft <- path_to_dataset1 fa <- path_to_dataset2 tic() d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect toc() 927.11 sec elapsed #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min ft <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet_temp') fa <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet') tic() d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect terminate called after throwing an instance of 'cpp11::unwind_exception' ``` Then I got an error that craspad_hendler.exe stopped working. And R becomes frozen, after a while R crashed too. !image-2022-11-11-14-59-30-132.png! > "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes > CPP11::unwind_execption, crashed R > -- > > Key: ARROW-18314 > URL: https://issues.apache.org/jira/browse/ARROW-18314 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > Attachments: image-2022-11-11-14-55-36-430.png, > image-2022-11-11-14-59-30-132.png > > > I issued two calls > ``` > ft <- path_to_dataset1 > fa <- path_to_dataset2 > #1) > tic() > d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect > toc() > 927.11 sec elapsed > #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min > #1) > tic() > d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect > terminate called after throwing an instance of 'cpp11::unwind_exception' > ``` > Then I got an error that craspad_hendler.exe stopped working. And R becomes > frozen, after a while R crashed too. > !image-2022-11-11-14-59-30-132.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18314) "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R
[ https://issues.apache.org/jira/browse/ARROW-18314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Mation updated ARROW-18314: - Summary: "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R (was: "open_dataset(f) %>% filder(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R) > "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes > CPP11::unwind_execption, crashed R > -- > > Key: ARROW-18314 > URL: https://issues.apache.org/jira/browse/ARROW-18314 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > Attachments: image-2022-11-11-14-55-36-430.png, > image-2022-11-11-14-59-30-132.png > > > I issued two calls > ``` > ft <- path_to_dataset1 > fa <- path_to_dataset2 > tic() > d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect > toc() > 927.11 sec elapsed > #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min > ft <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet_temp') > fa <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet') > tic() > d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect > terminate called after throwing an instance of 'cpp11::unwind_exception' > ``` > Then I got an error that craspad_hendler.exe stopped working. And R becomes > frozen, after a while R crashed too. > !image-2022-11-11-14-59-30-132.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18314) "open_dataset(f) %>% filder(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R
Lucas Mation created ARROW-18314: Summary: "open_dataset(f) %>% filder(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R Key: ARROW-18314 URL: https://issues.apache.org/jira/browse/ARROW-18314 Project: Apache Arrow Issue Type: Bug Reporter: Lucas Mation Attachments: image-2022-11-11-14-55-36-430.png, image-2022-11-11-14-59-30-132.png I issued two calls ``` ft <- path_to_dataset1 fa <- path_to_dataset2 tic() d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect toc() 927.11 sec elapsed #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min ft <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet_temp') fa <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet') tic() d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect terminate called after throwing an instance of 'cpp11::unwind_exception' ``` Then I got an error that craspad_hendler.exe stopped working. And R becomes frozen, after a while R crashed too. !image-2022-11-11-14-59-30-132.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16774) [C++] Create Filter Kernel on RLE data
[ https://issues.apache.org/jira/browse/ARROW-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-16774: - Assignee: (was: Tobias Zagorni) > [C++] Create Filter Kernel on RLE data > -- > > Key: ARROW-16774 > URL: https://issues.apache.org/jira/browse/ARROW-16774 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Tobias Zagorni >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17392) [C++] Disable anonymous namespaces in debug mode
[ https://issues.apache.org/jira/browse/ARROW-17392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-17392: - Assignee: (was: Sasha Krassovsky) > [C++] Disable anonymous namespaces in debug mode > > > Key: ARROW-17392 > URL: https://issues.apache.org/jira/browse/ARROW-17392 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Sasha Krassovsky >Priority: Major > > I've had some pain points when using GDB and the pervasive use of anonymous > namespaces throughout the code. I sent out an email on the mailing list and > no one seemed to have any opinions, so I am opening this task. This task will > gate anonymous namespaces around a `#ifndef NDEBUG` flag (or perhaps make a > RELEASE_MODE_ANONYMOUS_NAMESPACE macro of some sort). > > Mailing list discussion: > https://lists.apache.org/thread/61rjzb18mvft7lpwglyh4kq2gkbog4ts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16774) [C++] Create Filter Kernel on RLE data
[ https://issues.apache.org/jira/browse/ARROW-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632488#comment-17632488 ] Apache Arrow JIRA Bot commented on ARROW-16774: --- This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++] Create Filter Kernel on RLE data > -- > > Key: ARROW-16774 > URL: https://issues.apache.org/jira/browse/ARROW-16774 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Tobias Zagorni >Assignee: Tobias Zagorni >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17392) [C++] Disable anonymous namespaces in debug mode
[ https://issues.apache.org/jira/browse/ARROW-17392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632487#comment-17632487 ] Apache Arrow JIRA Bot commented on ARROW-17392: --- This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++] Disable anonymous namespaces in debug mode > > > Key: ARROW-17392 > URL: https://issues.apache.org/jira/browse/ARROW-17392 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Sasha Krassovsky >Assignee: Sasha Krassovsky >Priority: Major > > I've had some pain points when using GDB and the pervasive use of anonymous > namespaces throughout the code. I sent out an email on the mailing list and > no one seemed to have any opinions, so I am opening this task. This task will > gate anonymous namespaces around a `#ifndef NDEBUG` flag (or perhaps make a > RELEASE_MODE_ANONYMOUS_NAMESPACE macro of some sort). > > Mailing list discussion: > https://lists.apache.org/thread/61rjzb18mvft7lpwglyh4kq2gkbog4ts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16680) [R] Weird R error: Error in fs___FileSystem__GetTargetInfos_FileSelector(self, x) : ignoring SIGPIPE signal
[ https://issues.apache.org/jira/browse/ARROW-16680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632484#comment-17632484 ] Carl Boettiger commented on ARROW-16680: Wow, thanks Dewey! That looks like black magic to me but I can definitely confirm that it works! Still a bit stuck on the right thing to do in cases where we are providing user-facing packages that rely on arrow functions to access large external data, like you say I don't mind doing this in my scripts but it seems poor form to invisibly impose this on users where it may have side-effects with their other stuff? > [R] Weird R error: Error in > fs___FileSystem__GetTargetInfos_FileSelector(self, x) :ignoring SIGPIPE > signal > -- > > Key: ARROW-16680 > URL: https://issues.apache.org/jira/browse/ARROW-16680 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 >Reporter: Carl Boettiger >Priority: Major > > Okay apologies, this is a bit of a weird error but is annoying the heck out > of me. The following block of all R code, when run with Rscript (or embedded > into any form of Rmd, quarto, knitr doc) produces the error below (at least > most of the time): > > {code:java} > library(arrow) > library(dplyr){code} > {code:java} > Sys.setenv(AWS_EC2_METADATA_DISABLED = "TRUE") > Sys.unsetenv("AWS_ACCESS_KEY_ID") > Sys.unsetenv("AWS_SECRET_ACCESS_KEY") > Sys.unsetenv("AWS_DEFAULT_REGION") > Sys.unsetenv("AWS_S3_ENDPOINT")s3 <- arrow::s3_bucket(bucket = > "scores/parquet", > endpoint_override = "data.ecoforecast.org") > ds <- arrow::open_dataset(s3, partitioning = c("theme", "year")) > ds |> dplyr::filter(theme == "phenology") |> dplyr::collect() > {code} > Gives the error > > > {code:java} > Error in fs___FileSystem__GetTargetInfos_FileSelector(self, x) : > ignoring SIGPIPE signal > Calls: %>% ... -> fs___FileSystem__GetTargetInfos_FileSelector > {code} > But only when run as a script! When run interactively in an R console, this > code runs just fine. Even as a script the code seems to run fine, but > erroneously seems to be attempting this sigpipe I don't understand. > If the script is executed with litter > ([https://dirk.eddelbuettel.com/code/littler.html)] then it runs fine, since > littler handles sigpipe but Rscripts don't. But I have no idea why the above > code throws a pipe in the first place. Worse, if I choose a different filter > for the above, like "aquatics", it (usually) works without the error. > I have no idea why `fs___FileSystem__GetTargetInfos_FileSelector` results in > this, but would really appreciate any hints on how to avoid this as it makes > it very hard to use arrow in workflows right now! > > thanks for all you do! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18307) [C++] Read list/array data from ChunkedArray with multiple chunks
[ https://issues.apache.org/jira/browse/ARROW-18307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arthur Passos updated ARROW-18307: -- Description: I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with multiple chunks (column->num_chunks() > 1). The column in question, although not limited to, is of type Array(Int64). I want to convert this arrow column into an internal structure that contains a contiguous chunk of memory for the data and a vector of offsets, very similar to arrow's structure. The code I have so far works in two "phases": 1. Get nested arrow column data. In that case, get Int64 data out of Array(Int64). 2. Get offsets from Array(Int64). To achieve the #1, I am looping over the chunks and storing arrow::Array::values into a new arrow::ChunkedArray. {code:java} static std::shared_ptr getNestedArrowColumn(std::shared_ptr & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i))); std::shared_ptr chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared(array_vector); }{code} This does not work as expected, tho. Even though there are multiple chunks, the arrow::Array::values method returns the very same buffer for all of them, which ends up duplicating the data on my side. One pattern I noticed is that if I read only the Array(Int64) column, I get only one chunk. If I read both columns, I get two chunks. It looks like all columns will, inevitably, have the same number of chunks, even though its buffer is not chunked accordingly. I then looked through more examples and came across the [ColumnarTableToVector example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121]. It looks like this example assumes there is only on chunk and ignores the possibility of it having multiple chunks. It's probably just a detail and the test wasn't actually intended to cover multiple chunks. I managed to get the expected output doing something like the below: {code:java} auto & list_chunk1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))); auto & list_chunk2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))); auto l1_offset = *list_chunk1.raw_value_offsets(); auto l2_offset = *list_chunk2.raw_value_offsets(); auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length); auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length); auto lcv1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - l1_offset).ValueOrDie(); auto lcv2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - l2_offset).ValueOrDie();{code} This looks too hackish and I feel like there is a much better way. Hence, my question: How do I properly extract the data & offsets out of such column? A more generic version of this is: how to extract the data out of ChunkedArrays with multiple chunks? was: I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with multiple chunks (column->num_chunks() > 1). The column in question, although not limited to, is of type Array(Int64). I want to convert this arrow column into an internal structure that contains a contiguous chunk of memory for the data and a vector of offsets, very similar to arrow's structure. The code I have so far works in two "phases": 1. Get nested arrow column data. In that case, get Int64 data out of Array(Int64). 2. Get offsets from Array(Int64). To achieve the #1, I am looping over the chunks and storing arrow::Array::values into a new arrow::ChunkedArray. {code:java} static std::shared_ptr getNestedArrowColumn(std::shared_ptr & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i))); std::shared_ptr chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared(array_vector); }{code} This does not work as expected, tho. Even though there are multiple chunks, the arrow::Array::values method returns the very same buffer for all of them, which ends up duplicating the data on my side. I then looked through more examples and came across the [ColumnarTableToVector example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121]. It looks like this example ass
[jira] [Updated] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error
[ https://issues.apache.org/jira/browse/ARROW-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18278: --- Labels: pull-request-available (was: ) > [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error > - > > Key: ARROW-18278 > URL: https://issues.apache.org/jira/browse/ARROW-18278 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When building with maven on M1 [as per > docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]: > {code:bash} > mvn clean install > mvn generate-resources -Pgenerate-libs-jni-macos-linux -N > {code} > I get the following error: > {code:bash} > [INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root --- > -- Building using CMake version: 3.24.2 > -- The C compiler identification is AppleClang 14.0.0.1429 > -- The CXX compiler identification is AppleClang 14.0.0.1429 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: > /Library/Developer/CommandLineTools/usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: > /Library/Developer/CommandLineTools/usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Found Java: > /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found > version "11.0.16") > -- Found JNI: > /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include found > components: AWT JVM > CMake Error at dataset/CMakeLists.txt:18 (find_package): > By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project > has asked CMake to find a package configuration file provided by > "ArrowDataset", but CMake did not find one. > Could not find a package configuration file provided by "ArrowDataset" with > any of the following names: > ArrowDatasetConfig.cmake > arrowdataset-config.cmake > Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set > "ArrowDataset_DIR" to a directory containing one of the above files. If > "ArrowDataset" provides a separate development package or SDK, be sure it > has been installed. > -- Configuring incomplete, errors occurred! > See also > "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log". > See also > "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log". > [ERROR] Command execution failed. > org.apache.commons.exec.ExecuteException: Process exited with an error: 1 > (Exit value: 1) > at org.apache.commons.exec.DefaultExecutor.executeInternal > (DefaultExecutor.java:404) > at org.apache.commons.exec.DefaultExecutor.execute > (DefaultExecutor.java:166) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947) > at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471) > at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo > (DefaultBuildPluginManager.java:137) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 > (MojoExecutor.java:370) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute > (MojoExecutor.java:351) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:215) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:171) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:163) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:117) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:81) > at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build > (SingleThreadedBuilder.java:56) > at org.apache.maven.lifecycle.internal.LifecycleStarter.execute > (LifecycleStarter.java:128) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) > at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) > at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960) > at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293) > at org.apache.maven.cli.MavenCli.main (MavenCli.java:196) > at jdk.inte
[jira] [Assigned] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error
[ https://issues.apache.org/jira/browse/ARROW-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-18278: -- Assignee: Rok Mihevc > [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error > - > > Key: ARROW-18278 > URL: https://issues.apache.org/jira/browse/ARROW-18278 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > > When building with maven on M1 [as per > docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]: > {code:bash} > mvn clean install > mvn generate-resources -Pgenerate-libs-jni-macos-linux -N > {code} > I get the following error: > {code:bash} > [INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root --- > -- Building using CMake version: 3.24.2 > -- The C compiler identification is AppleClang 14.0.0.1429 > -- The CXX compiler identification is AppleClang 14.0.0.1429 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: > /Library/Developer/CommandLineTools/usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: > /Library/Developer/CommandLineTools/usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Found Java: > /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found > version "11.0.16") > -- Found JNI: > /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include found > components: AWT JVM > CMake Error at dataset/CMakeLists.txt:18 (find_package): > By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project > has asked CMake to find a package configuration file provided by > "ArrowDataset", but CMake did not find one. > Could not find a package configuration file provided by "ArrowDataset" with > any of the following names: > ArrowDatasetConfig.cmake > arrowdataset-config.cmake > Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set > "ArrowDataset_DIR" to a directory containing one of the above files. If > "ArrowDataset" provides a separate development package or SDK, be sure it > has been installed. > -- Configuring incomplete, errors occurred! > See also > "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log". > See also > "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log". > [ERROR] Command execution failed. > org.apache.commons.exec.ExecuteException: Process exited with an error: 1 > (Exit value: 1) > at org.apache.commons.exec.DefaultExecutor.executeInternal > (DefaultExecutor.java:404) > at org.apache.commons.exec.DefaultExecutor.execute > (DefaultExecutor.java:166) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947) > at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471) > at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo > (DefaultBuildPluginManager.java:137) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 > (MojoExecutor.java:370) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute > (MojoExecutor.java:351) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:215) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:171) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:163) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:117) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:81) > at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build > (SingleThreadedBuilder.java:56) > at org.apache.maven.lifecycle.internal.LifecycleStarter.execute > (LifecycleStarter.java:128) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) > at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) > at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960) > at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293) > at org.apache.maven.cli.MavenCli.main (MavenCli.java:196) > at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) > at jdk.internal.reflect.NativeMethodAccessorImpl.in
[jira] [Commented] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error
[ https://issues.apache.org/jira/browse/ARROW-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632309#comment-17632309 ] Rok Mihevc commented on ARROW-18278: This works @kou! I'll open a PR for the docs. The only thing I had to do extra was install protobuf for aarch_64 as suggested by the error I pasted above: {code:bash} mvn install:install-file -DgroupId=com.google.protobuf -DartifactId=protoc -Dversion=3.20.3 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file {code} I wonder if that can be automated somehow. > [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error > - > > Key: ARROW-18278 > URL: https://issues.apache.org/jira/browse/ARROW-18278 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Rok Mihevc >Priority: Major > > When building with maven on M1 [as per > docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]: > {code:bash} > mvn clean install > mvn generate-resources -Pgenerate-libs-jni-macos-linux -N > {code} > I get the following error: > {code:bash} > [INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root --- > -- Building using CMake version: 3.24.2 > -- The C compiler identification is AppleClang 14.0.0.1429 > -- The CXX compiler identification is AppleClang 14.0.0.1429 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: > /Library/Developer/CommandLineTools/usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: > /Library/Developer/CommandLineTools/usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Found Java: > /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found > version "11.0.16") > -- Found JNI: > /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include found > components: AWT JVM > CMake Error at dataset/CMakeLists.txt:18 (find_package): > By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project > has asked CMake to find a package configuration file provided by > "ArrowDataset", but CMake did not find one. > Could not find a package configuration file provided by "ArrowDataset" with > any of the following names: > ArrowDatasetConfig.cmake > arrowdataset-config.cmake > Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set > "ArrowDataset_DIR" to a directory containing one of the above files. If > "ArrowDataset" provides a separate development package or SDK, be sure it > has been installed. > -- Configuring incomplete, errors occurred! > See also > "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log". > See also > "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log". > [ERROR] Command execution failed. > org.apache.commons.exec.ExecuteException: Process exited with an error: 1 > (Exit value: 1) > at org.apache.commons.exec.DefaultExecutor.executeInternal > (DefaultExecutor.java:404) > at org.apache.commons.exec.DefaultExecutor.execute > (DefaultExecutor.java:166) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947) > at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471) > at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo > (DefaultBuildPluginManager.java:137) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 > (MojoExecutor.java:370) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute > (MojoExecutor.java:351) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:215) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:171) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:163) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:117) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:81) > at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build > (SingleThreadedBuilder.java:56) > at org.apache.maven.lifecycle.internal.LifecycleStarter.execute > (LifecycleStarter.java:128) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) > at org.apache.maven.DefaultMave
[jira] [Comment Edited] (ARROW-16340) [C++][Python] Move all Python related code into PyArrow
[ https://issues.apache.org/jira/browse/ARROW-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632275#comment-17632275 ] Yue Ni edited comment on ARROW-16340 at 11/11/22 11:52 AM: --- > Does it mean that you use Apache Arrow C++ from vcpkg and pyarrow wheel from > PyPI? Almost. I use Apache Arrow C++ from vcpkg (but I don't use the latest version of Arrow in vcpkg, instead, I use a fork of it with some gandiva related modification, and use a custom vcpkg port to manage the arrow dependency). > If so, you should not use Apache Arrow C++ from vcpkg. Could you briefly explain why this should not be done this way? was (Author: niyue): > Does it mean that you use Apache Arrow C++ from vcpkg and pyarrow wheel from > PyPI? Almost. I use Apache Arrow C++ from vcpkg (but I don't use the latest version of Arrow in vcpkg, instead, I use a fork of it with some gandiva related modification, and use a custom vcpkg port to manage the arrow dependency). > If so, you should not use Apache Arrow C++ from vcpkg. Could you briefly explain why this should not be done this way? > [C++][Python] Move all Python related code into PyArrow > --- > > Key: ARROW-16340 > URL: https://issues.apache.org/jira/browse/ARROW-16340 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 33h 10m > Remaining Estimate: 0h > > Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to > build it. > More details can be found on this thread: > https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16340) [C++][Python] Move all Python related code into PyArrow
[ https://issues.apache.org/jira/browse/ARROW-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632275#comment-17632275 ] Yue Ni commented on ARROW-16340: > Does it mean that you use Apache Arrow C++ from vcpkg and pyarrow wheel from > PyPI? Almost. I use Apache Arrow C++ from vcpkg (but I don't use the latest version of Arrow in vcpkg, instead, I use a fork of it with some gandiva related modification, and use a custom vcpkg port to manage the arrow dependency). > If so, you should not use Apache Arrow C++ from vcpkg. Could you briefly explain why this should not be done this way? > [C++][Python] Move all Python related code into PyArrow > --- > > Key: ARROW-16340 > URL: https://issues.apache.org/jira/browse/ARROW-16340 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 33h 10m > Remaining Estimate: 0h > > Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to > build it. > More details can be found on this thread: > https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632256#comment-17632256 ] Jacek Pliszka edited comment on ARROW-15474 at 11/11/22 11:04 AM: -- [~westonpace] maybe approach similar to what I proposed, but in better version whould work? We need compute function that for given array of values returns the index of the first/last appearance. Then all batches can be processed in parallel and at the end merged exactly as you described. Once we have index of the first/last appearance - we can use compute.take to have the output table. Maybe even ordering function can be specified so there would be no need to sort the array a priori. was (Author: jacek.pliszka): [~westonpace] maybe approach similar to what I proposed, but in better version whould work? We need compute function that for given array of values returns the index of the first/last appearance. Then all batches can be processed in parallel and at the end merged exactly as you described. Once we have index of the first/last appearance - we can use compute.take to have the output table. > [Python] Possibility of a table.drop_duplicates() function? > --- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 6.0.1 >Reporter: Lance Dacey >Priority: Major > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632256#comment-17632256 ] Jacek Pliszka commented on ARROW-15474: --- [~westonpace] maybe approach similar to what I proposed, but in better version whould work? We need compute function that for given array of values returns the index of the first/last appearance. Then all batches can be processed in parallel and at the end merged exactly as you described. Once we have index of the first/last appearance - we can use compute.take to have the output table. > [Python] Possibility of a table.drop_duplicates() function? > --- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 6.0.1 >Reporter: Lance Dacey >Priority: Major > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18313) Issues with open_dataset()
N Gautam Animesh created ARROW-18313: Summary: Issues with open_dataset() Key: ARROW-18313 URL: https://issues.apache.org/jira/browse/ARROW-18313 Project: Apache Arrow Issue Type: Bug Reporter: N Gautam Animesh Attachments: image-2022-11-11-09-19-16-065.png On using open_dataset, it creates a connection due to which the files in the directory get blocked and we cannot perform other operations on the file like replace! Actual issue: # We are running an atomic operation on a bunch of files, which replaces the temp file names to the target file names. # But while this is happening, if we try to run open_dataset() on that particular directory, the atomic operation is failing and there are both target files and temp files in the directory. # It is blocking the files that have been read through open_dataset(). # Please, provide me with more about how we can handle such problems. # Snapshot: !image-2022-11-11-09-19-16-065.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18310) [C++] Use atomic backpressure counter
[ https://issues.apache.org/jira/browse/ARROW-18310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaron Gvili reassigned ARROW-18310: --- Assignee: Yaron Gvili > [C++] Use atomic backpressure counter > - > > Key: ARROW-18310 > URL: https://issues.apache.org/jira/browse/ARROW-18310 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > > There are a few places in the code (sink_node.cc, source_node.cc, > file_base.cc) where the backpressure counter is of type `int32_t`. This > prevents `ExecNode::Pause(...)` and `ExecNode::Resume(...)` from being > thread-safe. The proposal is to make these backpressure counters be of type > `std::atomic`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18310) [C++] Use atomic backpressure counter
[ https://issues.apache.org/jira/browse/ARROW-18310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18310: --- Labels: pull-request-available (was: ) > [C++] Use atomic backpressure counter > - > > Key: ARROW-18310 > URL: https://issues.apache.org/jira/browse/ARROW-18310 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There are a few places in the code (sink_node.cc, source_node.cc, > file_base.cc) where the backpressure counter is of type `int32_t`. This > prevents `ExecNode::Pause(...)` and `ExecNode::Resume(...)` from being > thread-safe. The proposal is to make these backpressure counters be of type > `std::atomic`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18312) [C++] Optimize output sizes in segmented aggregation
Yaron Gvili created ARROW-18312: --- Summary: [C++] Optimize output sizes in segmented aggregation Key: ARROW-18312 URL: https://issues.apache.org/jira/browse/ARROW-18312 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yaron Gvili This is a [follow-up task|https://github.com/apache/arrow/pull/14352#discussion_r1019661909] for a currently pending PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18311) [C++] Add `Grouper::Reset`
Yaron Gvili created ARROW-18311: --- Summary: [C++] Add `Grouper::Reset` Key: ARROW-18311 URL: https://issues.apache.org/jira/browse/ARROW-18311 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yaron Gvili Adding `Grouper::Reset` will enable it to be reused in segmented streaming. `See [this post|https://github.com/apache/arrow/pull/14352#discussion_r1016640969]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18310) [C++] Use atomic backpressure counter
Yaron Gvili created ARROW-18310: --- Summary: [C++] Use atomic backpressure counter Key: ARROW-18310 URL: https://issues.apache.org/jira/browse/ARROW-18310 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yaron Gvili There are a few places in the code (sink_node.cc, source_node.cc, file_base.cc) where the backpressure counter is of type `int32_t`. This prevents `ExecNode::Pause(...)` and `ExecNode::Resume(...)` from being thread-safe. The proposal is to make these backpressure counters be of type `std::atomic`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18269) [C++] Slash character in partition value handling
[ https://issues.apache.org/jira/browse/ARROW-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632152#comment-17632152 ] Vibhatha Lakmal Abeykoon commented on ARROW-18269: -- [~westonpace] So here the context is that, the partition column data is being used to formulate the save directory path. When there is a '/' in data, this value get implicitly considered as a separator when we form the directory path. Thus `A/Z` makes a `A` folder and `Z` inside it. Not sure we can remove that part or ask the code to ignore it. But, in the reading part, when we recreate the fragments, we could decide whether to consider it as a path or just as a single value. If we consider it as a path (which is being done at the moment), we would get the erroneous output, but if we say don't consider it as a path, but as a non-path, we could retrieve the value accurately. This is one viable option. If we do that, we can provide a lamda or flag to determine this behavior. I think a function to determine the key decoding from the file path would be better. Is this overly complicated or a non-generic solution? Although I am inclined towards option 1 and not option 2. Option 2 is pretty straightforward to do, but a case as mentioned above could be very common. How is the URL encoding/decoding part relevant here? Am I missing something? Could you please clarify a bit? > [C++] Slash character in partition value handling > - > > Key: ARROW-18269 > URL: https://issues.apache.org/jira/browse/ARROW-18269 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 10.0.0 >Reporter: Vadym Dytyniak >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: good-first-issue > > > Provided example shows that pyarrow does not handle partition value that > contains '/' correctly: > {code:java} > import pandas as pd > import pyarrow as pa > from pyarrow import dataset as ds > df = pd.DataFrame({ > 'value': [1, 2], > 'instrument_id': ['A/Z', 'B'], > }) > ds.write_dataset( > data=pa.Table.from_pandas(df), > base_dir='data', > format='parquet', > partitioning=['instrument_id'], > partitioning_flavor='hive', > ) > table = ds.dataset( > source='data', > format='parquet', > partitioning='hive', > ).to_table() > tables = [table] > df = pa.concat_tables(tables).to_pandas() tables = [table] > df = pa.concat_tables(tables).to_pandas() > print(df.head()){code} > Result: > {code:java} > value instrument_id > 0 1 A > 1 2 B {code} > Expected behaviour: > Option 1: Result should be: > {code:java} > value instrument_id > 0 1 A/Z > 1 2 B {code} > Option 2: Error should be raised to avoid '/' in partition value. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18272) [pyarrow] ParquetFile does not recognize GCS cloud path as a string
[ https://issues.apache.org/jira/browse/ARROW-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632147#comment-17632147 ] Zepu Zhang commented on ARROW-18272: Yes, I'm making it work that way for now. Actually I have a case where I don't want the convenience of passing a str to it, because I'm processing a large number of files, and I don't want it to do the default credential inference for each file. So I do this: ``` gcs = pyarrow.fs.GcsFileSystem(token=..., credential_token_expiration=...) parquet_file = pyarrow.parquet.ParquetFile(gcs.open_input_file('mybucket/abc/d.parquet')) ``` However, API consistency and function signatures suggest `ParquetFile` and `read_metadata` should accept the same types of `where`. > [pyarrow] ParquetFile does not recognize GCS cloud path as a string > --- > > Key: ARROW-18272 > URL: https://issues.apache.org/jira/browse/ARROW-18272 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 10.0.0 >Reporter: Zepu Zhang >Priority: Minor > > I have a Parquet file at > > path = 'gs://mybucket/abc/d.parquet' > > `pyarrow.parquet.read_metadata(path)` works fine. > > `pyarrow.parquet.ParquetFile(path)` raises "Failed to open local file > 'gs://mybucket/abc/d.parquet'. > > Looks like ParquetFile misses the path resolution logic found in > `read_metadata` -- This message was sent by Atlassian Jira (v8.20.10#820010)