[jira] [Comment Edited] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-11-11 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632598#comment-17632598
 ] 

Vibhatha Lakmal Abeykoon edited comment on ARROW-15716 at 11/12/22 3:50 AM:


Yeah, that is true since it is always the equality operator. But for other 
comparator operators it won't. So it is better to fuse it at parse rather than 
at to_table. 

Because when filtering range of values as follows.
{code:python}
import pyarrow.dataset as ds
df = pd.DataFrame({'a' : [1, 2, 1, 2, 3, 4, 5, 1, 2, 4, 7, 8],
   'b' : [10, 30, 20, 40, 50, 60, 30, 50, 60, 10, 11, 12]})
table = pa.Table.from_pandas(df)
path = tempdir / 'partitioning'

collector = []
ds.write_dataset(
table,
base_dir=path,
format="parquet",
partitioning=["a"],
partitioning_flavor="hive",
file_visitor=lambda x: collector.append(x)
)

paths = [file.path for file in collector]
partitioning = ds.partitioning(flavor="hive")

dataset = ds.dataset(source=path, partitioning=partitioning)

filter_expressions = [dataset.partitioning.parse(path) for path in paths]

f11 = ds.field("a") > pc.scalar(3)
f22 = ds.field("a") < pc.scalar(6)
f3 = f11 & f22
print(f3)
new_table = dataset.to_table(filter=f3)
print(table.to_pandas())
print("-" * 80)
print(new_table.to_pandas())

{code}


was (Author: vibhatha):
Yeah, that is true since it is always the equality operator. But for other 
comparator operators it won't. So it is better to fuse it at parse rather than 
at to_table. 

Because when filtering range of values as follows.
{code:python}
import pyarrow.dataset as ds
df = pd.DataFrame({'a' : [1, 2, 1, 2, 3, 4, 5, 1, 2, 4, 7, 8],
   'b' : [10, 30, 20, 40, 50, 60, 30, 50, 60, 10, 11, 12]})
table = pa.Table.from_pandas(df)
path = tempdir / 'partitioning'

collector = []
ds.write_dataset(
table,
base_dir=path,
format="parquet",
partitioning=["a"],
partitioning_flavor="hive",
file_visitor=lambda x: collector.append(x)
)

paths = [file.path for file in collector]
partitioning = ds.partitioning(flavor="hive")

dataset = ds.dataset(source=path, partitioning=partitioning)

filter_expressions = [dataset.partitioning.parse(path) for path in paths]

f1 = ds.field("a") > pc.scalar(3)
f2 = ds.field("a") < pc.scalar(8)
f11 = ds.field("a") > pc.scalar(3)
f22 = ds.field("a") < pc.scalar(6)
f3 = f11 & f22
print(f3)
new_table = dataset.to_table(filter=f3)
print(table.to_pandas())
print("-" * 80)
print(new_table.to_pandas())

{code}

> [Dataset][Python] Parse a list of fragment paths to gather filters
> --
>
> Key: ARROW-15716
> URL: https://issues.apache.org/jira/browse/ARROW-15716
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Lance Dacey
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [, 
> , 
> , 
> ]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-11-11 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632598#comment-17632598
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-15716:
--

Yeah, that is true since it is always the equality operator. But for other 
comparator operators it won't. So it is better to fuse it at parse rather than 
at to_table. 

Because when filtering range of values as follows.
{code:python}
import pyarrow.dataset as ds
df = pd.DataFrame({'a' : [1, 2, 1, 2, 3, 4, 5, 1, 2, 4, 7, 8],
   'b' : [10, 30, 20, 40, 50, 60, 30, 50, 60, 10, 11, 12]})
table = pa.Table.from_pandas(df)
path = tempdir / 'partitioning'

collector = []
ds.write_dataset(
table,
base_dir=path,
format="parquet",
partitioning=["a"],
partitioning_flavor="hive",
file_visitor=lambda x: collector.append(x)
)

paths = [file.path for file in collector]
partitioning = ds.partitioning(flavor="hive")

dataset = ds.dataset(source=path, partitioning=partitioning)

filter_expressions = [dataset.partitioning.parse(path) for path in paths]

f1 = ds.field("a") > pc.scalar(3)
f2 = ds.field("a") < pc.scalar(8)
f11 = ds.field("a") > pc.scalar(3)
f22 = ds.field("a") < pc.scalar(6)
f3 = f11 & f22
print(f3)
new_table = dataset.to_table(filter=f3)
print(table.to_pandas())
print("-" * 80)
print(new_table.to_pandas())

{code}

> [Dataset][Python] Parse a list of fragment paths to gather filters
> --
>
> Key: ARROW-15716
> URL: https://issues.apache.org/jira/browse/ARROW-15716
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Lance Dacey
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [, 
> , 
> , 
> ]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16340) [C++][Python] Move all Python related code into PyArrow

2022-11-11 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632551#comment-17632551
 ] 

Kouhei Sutou commented on ARROW-16340:
--

Because pyarrow wheel includes pre-built Apache Arrow C++ library. If you use 
both of Apache Arrow C++ from vcpkg and pyarrow wheel from PyPI, you mix 
multiple Apache Arrow C++ libraries. It causes unexpected behavior such as a 
crash.

> [C++][Python] Move all Python related code into PyArrow
> ---
>
> Key: ARROW-16340
> URL: https://issues.apache.org/jira/browse/ARROW-16340
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 33h 10m
>  Remaining Estimate: 0h
>
> Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to 
> build it.
> More details can be found on this thread:
> https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18314) "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R

2022-11-11 Thread Lucas Mation (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Mation updated ARROW-18314:
-
Description: 
This is running on a windows environment, arrow 10.0.0 (see arrow_info() below)

I issued two calls

```

ft <- path_to_dataset1
fa <- path_to_dataset2

#1)

tic()
d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
toc()
927.11 sec elapsed

#returned a dataset with 44 obs, 38 columns, took abnormal time, 16min

#1)

tic()
d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
terminate called after throwing an instance of 'cpp11::unwind_exception'

```

Then I got an error that craspad_hendler.exe stopped working. And R becomes 
frozen, after a while R crashed too.

!image-2022-11-11-14-59-30-132.png!

 

arrow_info()
Arrow package version: 10.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                  
Allocator mimalloc
Current    0 bytes
Max        0 bytes

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                                             
C++ Library Version                                    10.0.0
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

 

 

 

  was:
I issued two calls

```

ft <- path_to_dataset1
fa <- path_to_dataset2

#1)

tic()
d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
toc()
927.11 sec elapsed

#returned a dataset with 44 obs, 38 columns, took abnormal time, 16min

#1)

tic()
d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
terminate called after throwing an instance of 'cpp11::unwind_exception'

```

Then I got an error that craspad_hendler.exe stopped working. And R becomes 
frozen, after a while R crashed too.

!image-2022-11-11-14-59-30-132.png!

 

 


> "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes  
> CPP11::unwind_execption, crashed R
> --
>
> Key: ARROW-18314
> URL: https://issues.apache.org/jira/browse/ARROW-18314
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
> Attachments: image-2022-11-11-14-55-36-430.png, 
> image-2022-11-11-14-59-30-132.png
>
>
> This is running on a windows environment, arrow 10.0.0 (see arrow_info() 
> below)
> I issued two calls
> ```
> ft <- path_to_dataset1
> fa <- path_to_dataset2
> #1)
> tic()
> d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> toc()
> 927.11 sec elapsed
> #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min
> #1)
> tic()
> d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> terminate called after throwing an instance of 'cpp11::unwind_exception'
> ```
> Then I got an error that craspad_hendler.exe stopped working. And R becomes 
> frozen, after a while R crashed too.
> !image-2022-11-11-14-59-30-132.png!
>  
> arrow_info()
> Arrow package version: 10.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> gcs        TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc  FALSE
> mimalloc   TRUE
> Arrow options():
>                        
> arrow.use_threads FALSE
> Memory:
>                   
> Allocator mimalloc
> Current    0 bytes
> Max        0 bytes
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                                                              
> C++ Library Version                                    10.0.0
> C++ Compiler                                              GNU
> C++ Compiler Version                                   10.3.0
> Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18314) "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R

2022-11-11 Thread Lucas Mation (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Mation updated ARROW-18314:
-
Description: 
I issued two calls

```

ft <- path_to_dataset1
fa <- path_to_dataset2

#1)

tic()
d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
toc()
927.11 sec elapsed

#returned a dataset with 44 obs, 38 columns, took abnormal time, 16min

#1)

tic()
d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
terminate called after throwing an instance of 'cpp11::unwind_exception'

```

Then I got an error that craspad_hendler.exe stopped working. And R becomes 
frozen, after a while R crashed too.

!image-2022-11-11-14-59-30-132.png!

 

 

  was:
I issued two calls

```

ft <- path_to_dataset1
fa <- path_to_dataset2

tic()
d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
toc()
927.11 sec elapsed

#returned a dataset with 44 obs, 38 columns, took abnormal time, 16min

ft <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet_temp')
fa <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet')
tic()
d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
terminate called after throwing an instance of 'cpp11::unwind_exception'

```

Then I got an error that craspad_hendler.exe stopped working. And R becomes 
frozen, after a while R crashed too.

!image-2022-11-11-14-59-30-132.png!

 

 


> "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes  
> CPP11::unwind_execption, crashed R
> --
>
> Key: ARROW-18314
> URL: https://issues.apache.org/jira/browse/ARROW-18314
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
> Attachments: image-2022-11-11-14-55-36-430.png, 
> image-2022-11-11-14-59-30-132.png
>
>
> I issued two calls
> ```
> ft <- path_to_dataset1
> fa <- path_to_dataset2
> #1)
> tic()
> d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> toc()
> 927.11 sec elapsed
> #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min
> #1)
> tic()
> d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> terminate called after throwing an instance of 'cpp11::unwind_exception'
> ```
> Then I got an error that craspad_hendler.exe stopped working. And R becomes 
> frozen, after a while R crashed too.
> !image-2022-11-11-14-59-30-132.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18314) "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R

2022-11-11 Thread Lucas Mation (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Mation updated ARROW-18314:
-
Summary: "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes  
CPP11::unwind_execption, crashed R  (was: "open_dataset(f) %>% filder(id %in% 
myvec) %>% collect" causes  CPP11::unwind_execption, crashed R)

> "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes  
> CPP11::unwind_execption, crashed R
> --
>
> Key: ARROW-18314
> URL: https://issues.apache.org/jira/browse/ARROW-18314
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
> Attachments: image-2022-11-11-14-55-36-430.png, 
> image-2022-11-11-14-59-30-132.png
>
>
> I issued two calls
> ```
> ft <- path_to_dataset1
> fa <- path_to_dataset2
> tic()
> d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> toc()
> 927.11 sec elapsed
> #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min
> ft <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet_temp')
> fa <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet')
> tic()
> d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> terminate called after throwing an instance of 'cpp11::unwind_exception'
> ```
> Then I got an error that craspad_hendler.exe stopped working. And R becomes 
> frozen, after a while R crashed too.
> !image-2022-11-11-14-59-30-132.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18314) "open_dataset(f) %>% filder(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R

2022-11-11 Thread Lucas Mation (Jira)
Lucas Mation created ARROW-18314:


 Summary: "open_dataset(f) %>% filder(id %in% myvec) %>% collect" 
causes  CPP11::unwind_execption, crashed R
 Key: ARROW-18314
 URL: https://issues.apache.org/jira/browse/ARROW-18314
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lucas Mation
 Attachments: image-2022-11-11-14-55-36-430.png, 
image-2022-11-11-14-59-30-132.png

I issued two calls

```

ft <- path_to_dataset1
fa <- path_to_dataset2

tic()
d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
toc()
927.11 sec elapsed

#returned a dataset with 44 obs, 38 columns, took abnormal time, 16min

ft <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet_temp')
fa <- paste0(p2,'/RAIS_operacional/vinc_1976_2001/parquet')
tic()
d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
terminate called after throwing an instance of 'cpp11::unwind_exception'

```

Then I got an error that craspad_hendler.exe stopped working. And R becomes 
frozen, after a while R crashed too.

!image-2022-11-11-14-59-30-132.png!

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16774) [C++] Create Filter Kernel on RLE data

2022-11-11 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-16774:
-

Assignee: (was: Tobias Zagorni)

> [C++] Create Filter Kernel on RLE data
> --
>
> Key: ARROW-16774
> URL: https://issues.apache.org/jira/browse/ARROW-16774
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17392) [C++] Disable anonymous namespaces in debug mode

2022-11-11 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-17392:
-

Assignee: (was: Sasha Krassovsky)

> [C++] Disable anonymous namespaces in debug mode
> 
>
> Key: ARROW-17392
> URL: https://issues.apache.org/jira/browse/ARROW-17392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sasha Krassovsky
>Priority: Major
>
> I've had some pain points when using GDB and the pervasive use of anonymous 
> namespaces throughout the code. I sent out an email on the mailing list and 
> no one seemed to have any opinions, so I am opening this task. This task will 
> gate anonymous namespaces around a `#ifndef NDEBUG` flag (or perhaps make a 
> RELEASE_MODE_ANONYMOUS_NAMESPACE macro of some sort).
>  
> Mailing list discussion: 
> https://lists.apache.org/thread/61rjzb18mvft7lpwglyh4kq2gkbog4ts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16774) [C++] Create Filter Kernel on RLE data

2022-11-11 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632488#comment-17632488
 ] 

Apache Arrow JIRA Bot commented on ARROW-16774:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++] Create Filter Kernel on RLE data
> --
>
> Key: ARROW-16774
> URL: https://issues.apache.org/jira/browse/ARROW-16774
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17392) [C++] Disable anonymous namespaces in debug mode

2022-11-11 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632487#comment-17632487
 ] 

Apache Arrow JIRA Bot commented on ARROW-17392:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++] Disable anonymous namespaces in debug mode
> 
>
> Key: ARROW-17392
> URL: https://issues.apache.org/jira/browse/ARROW-17392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sasha Krassovsky
>Assignee: Sasha Krassovsky
>Priority: Major
>
> I've had some pain points when using GDB and the pervasive use of anonymous 
> namespaces throughout the code. I sent out an email on the mailing list and 
> no one seemed to have any opinions, so I am opening this task. This task will 
> gate anonymous namespaces around a `#ifndef NDEBUG` flag (or perhaps make a 
> RELEASE_MODE_ANONYMOUS_NAMESPACE macro of some sort).
>  
> Mailing list discussion: 
> https://lists.apache.org/thread/61rjzb18mvft7lpwglyh4kq2gkbog4ts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16680) [R] Weird R error: Error in fs___FileSystem__GetTargetInfos_FileSelector(self, x) : ignoring SIGPIPE signal

2022-11-11 Thread Carl Boettiger (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632484#comment-17632484
 ] 

Carl Boettiger commented on ARROW-16680:


Wow, thanks Dewey!  That looks like black magic to me but I can definitely 
confirm that it works!

 

Still a bit stuck on the right thing to do in cases where we are providing 
user-facing packages that rely on arrow functions to access large external 
data, like you say I don't mind doing this in my scripts but it seems poor form 
to invisibly impose this on users where it may have side-effects with their 
other stuff?

> [R] Weird R error: Error in 
> fs___FileSystem__GetTargetInfos_FileSelector(self, x) :ignoring SIGPIPE 
> signal
> --
>
> Key: ARROW-16680
> URL: https://issues.apache.org/jira/browse/ARROW-16680
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Carl Boettiger
>Priority: Major
>
> Okay apologies, this is a bit of a weird error but is annoying the heck out 
> of me.  The following block of all R code, when run with Rscript (or embedded 
> into any form of Rmd, quarto, knitr doc) produces the error below (at least 
> most of the time):
>  
> {code:java}
> library(arrow)
> library(dplyr){code}
> {code:java}
> Sys.setenv(AWS_EC2_METADATA_DISABLED = "TRUE")
> Sys.unsetenv("AWS_ACCESS_KEY_ID")
> Sys.unsetenv("AWS_SECRET_ACCESS_KEY")
> Sys.unsetenv("AWS_DEFAULT_REGION")
> Sys.unsetenv("AWS_S3_ENDPOINT")s3 <- arrow::s3_bucket(bucket = 
> "scores/parquet",
>                        endpoint_override = "data.ecoforecast.org")
> ds <- arrow::open_dataset(s3, partitioning = c("theme", "year"))
> ds |> dplyr::filter(theme == "phenology") |> dplyr::collect()
> {code}
> Gives the error
>  
>  
> {code:java}
> Error in fs___FileSystem__GetTargetInfos_FileSelector(self, x) : 
>   ignoring SIGPIPE signal
> Calls: %>% ...  -> fs___FileSystem__GetTargetInfos_FileSelector 
> {code}
> But only when run as a script! When run interactively in an R console, this 
> code runs just fine.  Even as a script the code seems to run fine, but 
> erroneously seems to be attempting this sigpipe I don't understand.  
> If the script is executed with litter 
> ([https://dirk.eddelbuettel.com/code/littler.html)] then it runs fine, since 
> littler handles sigpipe but Rscripts don't.  But I have no idea why the above 
> code throws a pipe in the first place.  Worse, if I choose a different filter 
> for the above, like "aquatics", it (usually) works without the error.  
> I have no idea why `fs___FileSystem__GetTargetInfos_FileSelector` results in 
> this, but would really appreciate any hints on how to avoid this as it makes 
> it very hard to use arrow in workflows right now! 
>  
> thanks for all you do!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18307) [C++] Read list/array data from ChunkedArray with multiple chunks

2022-11-11 Thread Arthur Passos (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arthur Passos updated ARROW-18307:
--
Description: 
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with multiple chunks (column->num_chunks() > 1). The 
column in question, although not limited to, is of type Array(Int64).

 

I want to convert this arrow column into an internal structure that contains a 
contiguous chunk of memory for the data and a vector of offsets, very similar 
to arrow's structure. The code I have so far works in two "phases":

1. Get nested arrow column data. In that case, get Int64 data out of 
Array(Int64).
2. Get offsets from Array(Int64).

To achieve the #1, I am looping over the chunks and storing 
arrow::Array::values into a new arrow::ChunkedArray.

 
{code:java}
static std::shared_ptr 
getNestedArrowColumn(std::shared_ptr & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks = 
static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i)));
std::shared_ptr chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared(array_vector);
}{code}
This does not work as expected, tho. Even though there are multiple chunks, the 
arrow::Array::values method returns the very same buffer for all of them, which 
ends up duplicating the data on my side. One pattern I noticed is that if I 
read only the Array(Int64) column, I get only one chunk. If I read both 
columns, I get two chunks. It looks like all columns will, inevitably, have the 
same number of chunks, even though its buffer is not chunked accordingly.

I then looked through more examples and came across the [ColumnarTableToVector 
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121].
 It looks like this example assumes there is only on chunk and ignores the 
possibility of it having multiple chunks. It's probably just a detail and the 
test wasn't actually intended to cover multiple chunks.

I managed to get the expected output doing something like the below:
{code:java}
auto & list_chunk1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0)));
auto & list_chunk2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1)));

auto l1_offset = *list_chunk1.raw_value_offsets();
auto l2_offset = *list_chunk2.raw_value_offsets();

auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length);
auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length);

auto lcv1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - 
l1_offset).ValueOrDie();
auto lcv2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - 
l2_offset).ValueOrDie();{code}
This looks too hackish and I feel like there is a much better way.

Hence, my question: How do I properly extract the data & offsets out of such 
column? A more generic version of this is: how to extract the data out of 
ChunkedArrays with multiple chunks?

  was:
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with multiple chunks (column->num_chunks() > 1). The 
column in question, although not limited to, is of type Array(Int64).

 

I want to convert this arrow column into an internal structure that contains a 
contiguous chunk of memory for the data and a vector of offsets, very similar 
to arrow's structure. The code I have so far works in two "phases":

1. Get nested arrow column data. In that case, get Int64 data out of 
Array(Int64).
2. Get offsets from Array(Int64).

To achieve the #1, I am looping over the chunks and storing 
arrow::Array::values into a new arrow::ChunkedArray.



 
{code:java}
static std::shared_ptr 
getNestedArrowColumn(std::shared_ptr & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks = 
static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i)));
std::shared_ptr chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared(array_vector);
}{code}

This does not work as expected, tho. Even though there are multiple chunks, the 
arrow::Array::values method returns the very same buffer for all of them, which 
ends up duplicating the data on my side.

I then looked through more examples and came across the [ColumnarTableToVector 
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121].
 It looks like this example ass

[jira] [Updated] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error

2022-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18278:
---
Labels: pull-request-available  (was: )

> [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error
> -
>
> Key: ARROW-18278
> URL: https://issues.apache.org/jira/browse/ARROW-18278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When building with maven on M1 [as per 
> docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]:
> {code:bash}
> mvn clean install
> mvn generate-resources -Pgenerate-libs-jni-macos-linux -N
> {code}
> I get the following error:
> {code:bash}
> [INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root ---
> -- Building using CMake version: 3.24.2
> -- The C compiler identification is AppleClang 14.0.0.1429
> -- The CXX compiler identification is AppleClang 14.0.0.1429
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: 
> /Library/Developer/CommandLineTools/usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: 
> /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Found Java: 
> /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found 
> version "11.0.16") 
> -- Found JNI: 
> /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include  found 
> components: AWT JVM 
> CMake Error at dataset/CMakeLists.txt:18 (find_package):
>   By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project
>   has asked CMake to find a package configuration file provided by
>   "ArrowDataset", but CMake did not find one.
>   Could not find a package configuration file provided by "ArrowDataset" with
>   any of the following names:
> ArrowDatasetConfig.cmake
> arrowdataset-config.cmake
>   Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set
>   "ArrowDataset_DIR" to a directory containing one of the above files.  If
>   "ArrowDataset" provides a separate development package or SDK, be sure it
>   has been installed.
> -- Configuring incomplete, errors occurred!
> See also 
> "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log".
> See also 
> "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log".
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
> at org.apache.commons.exec.DefaultExecutor.executeInternal 
> (DefaultExecutor.java:404)
> at org.apache.commons.exec.DefaultExecutor.execute 
> (DefaultExecutor.java:166)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947)
> at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:137)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
> (MojoExecutor.java:370)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
> (MojoExecutor.java:351)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:215)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:171)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:163)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:81)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:56)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
> at jdk.inte

[jira] [Assigned] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error

2022-11-11 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-18278:
--

Assignee: Rok Mihevc

> [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error
> -
>
> Key: ARROW-18278
> URL: https://issues.apache.org/jira/browse/ARROW-18278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>
> When building with maven on M1 [as per 
> docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]:
> {code:bash}
> mvn clean install
> mvn generate-resources -Pgenerate-libs-jni-macos-linux -N
> {code}
> I get the following error:
> {code:bash}
> [INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root ---
> -- Building using CMake version: 3.24.2
> -- The C compiler identification is AppleClang 14.0.0.1429
> -- The CXX compiler identification is AppleClang 14.0.0.1429
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: 
> /Library/Developer/CommandLineTools/usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: 
> /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Found Java: 
> /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found 
> version "11.0.16") 
> -- Found JNI: 
> /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include  found 
> components: AWT JVM 
> CMake Error at dataset/CMakeLists.txt:18 (find_package):
>   By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project
>   has asked CMake to find a package configuration file provided by
>   "ArrowDataset", but CMake did not find one.
>   Could not find a package configuration file provided by "ArrowDataset" with
>   any of the following names:
> ArrowDatasetConfig.cmake
> arrowdataset-config.cmake
>   Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set
>   "ArrowDataset_DIR" to a directory containing one of the above files.  If
>   "ArrowDataset" provides a separate development package or SDK, be sure it
>   has been installed.
> -- Configuring incomplete, errors occurred!
> See also 
> "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log".
> See also 
> "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log".
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
> at org.apache.commons.exec.DefaultExecutor.executeInternal 
> (DefaultExecutor.java:404)
> at org.apache.commons.exec.DefaultExecutor.execute 
> (DefaultExecutor.java:166)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947)
> at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:137)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
> (MojoExecutor.java:370)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
> (MojoExecutor.java:351)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:215)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:171)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:163)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:81)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:56)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
> at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
> at jdk.internal.reflect.NativeMethodAccessorImpl.in

[jira] [Commented] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error

2022-11-11 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632309#comment-17632309
 ] 

Rok Mihevc commented on ARROW-18278:


This works @kou! I'll open a PR for the docs.
The only thing I had to do extra was install protobuf for aarch_64 as suggested 
by the error I pasted above:
{code:bash}
mvn install:install-file -DgroupId=com.google.protobuf -DartifactId=protoc 
-Dversion=3.20.3 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file
{code}
I wonder if that can be automated somehow.


> [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error
> -
>
> Key: ARROW-18278
> URL: https://issues.apache.org/jira/browse/ARROW-18278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Rok Mihevc
>Priority: Major
>
> When building with maven on M1 [as per 
> docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]:
> {code:bash}
> mvn clean install
> mvn generate-resources -Pgenerate-libs-jni-macos-linux -N
> {code}
> I get the following error:
> {code:bash}
> [INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root ---
> -- Building using CMake version: 3.24.2
> -- The C compiler identification is AppleClang 14.0.0.1429
> -- The CXX compiler identification is AppleClang 14.0.0.1429
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: 
> /Library/Developer/CommandLineTools/usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: 
> /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Found Java: 
> /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found 
> version "11.0.16") 
> -- Found JNI: 
> /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include  found 
> components: AWT JVM 
> CMake Error at dataset/CMakeLists.txt:18 (find_package):
>   By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project
>   has asked CMake to find a package configuration file provided by
>   "ArrowDataset", but CMake did not find one.
>   Could not find a package configuration file provided by "ArrowDataset" with
>   any of the following names:
> ArrowDatasetConfig.cmake
> arrowdataset-config.cmake
>   Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set
>   "ArrowDataset_DIR" to a directory containing one of the above files.  If
>   "ArrowDataset" provides a separate development package or SDK, be sure it
>   has been installed.
> -- Configuring incomplete, errors occurred!
> See also 
> "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log".
> See also 
> "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log".
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
> at org.apache.commons.exec.DefaultExecutor.executeInternal 
> (DefaultExecutor.java:404)
> at org.apache.commons.exec.DefaultExecutor.execute 
> (DefaultExecutor.java:166)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947)
> at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:137)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
> (MojoExecutor.java:370)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
> (MojoExecutor.java:351)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:215)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:171)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:163)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:81)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:56)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
> at org.apache.maven.DefaultMave

[jira] [Comment Edited] (ARROW-16340) [C++][Python] Move all Python related code into PyArrow

2022-11-11 Thread Yue Ni (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632275#comment-17632275
 ] 

Yue Ni edited comment on ARROW-16340 at 11/11/22 11:52 AM:
---

> Does it mean that you use Apache Arrow C++ from vcpkg and pyarrow wheel from 
> PyPI?

Almost. I use Apache Arrow C++ from vcpkg (but I don't use the latest version 
of Arrow in vcpkg, instead, I use a fork of it with some gandiva related 
modification, and use a custom vcpkg port to manage the arrow dependency).

> If so, you should not use Apache Arrow C++ from vcpkg. 

Could you briefly explain why this should not be done this way?


was (Author: niyue):
> Does it mean that you use Apache Arrow C++ from vcpkg and pyarrow wheel from 
> PyPI?

Almost. I use Apache Arrow C++ from vcpkg (but I don't use the latest version 
of Arrow in vcpkg, instead, I use a fork of it with some gandiva related 
modification, and use a custom vcpkg port to manage the arrow dependency).

> If so, you should not use Apache Arrow C++ from vcpkg. 

Could you briefly explain why this should not be done this way?

> [C++][Python] Move all Python related code into PyArrow
> ---
>
> Key: ARROW-16340
> URL: https://issues.apache.org/jira/browse/ARROW-16340
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 33h 10m
>  Remaining Estimate: 0h
>
> Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to 
> build it.
> More details can be found on this thread:
> https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16340) [C++][Python] Move all Python related code into PyArrow

2022-11-11 Thread Yue Ni (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632275#comment-17632275
 ] 

Yue Ni commented on ARROW-16340:


> Does it mean that you use Apache Arrow C++ from vcpkg and pyarrow wheel from 
> PyPI?

Almost. I use Apache Arrow C++ from vcpkg (but I don't use the latest version 
of Arrow in vcpkg, instead, I use a fork of it with some gandiva related 
modification, and use a custom vcpkg port to manage the arrow dependency).

> If so, you should not use Apache Arrow C++ from vcpkg. 

Could you briefly explain why this should not be done this way?

> [C++][Python] Move all Python related code into PyArrow
> ---
>
> Key: ARROW-16340
> URL: https://issues.apache.org/jira/browse/ARROW-16340
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 33h 10m
>  Remaining Estimate: 0h
>
> Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to 
> build it.
> More details can be found on this thread:
> https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-11-11 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632256#comment-17632256
 ] 

Jacek Pliszka edited comment on ARROW-15474 at 11/11/22 11:04 AM:
--

[~westonpace] maybe approach similar to what I proposed, but in better version 
whould work?

We need compute function that for given array of values returns the index of 
the first/last appearance.
Then all batches can be processed in parallel and at the end merged exactly as 
you described.

Once we have index of the first/last appearance - we can use compute.take to 
have the output table.

Maybe even ordering function can be specified so there would be no need to sort 
the array a priori.




was (Author: jacek.pliszka):
[~westonpace] maybe approach similar to what I proposed, but in better version 
whould work?

We need compute function that for given array of values returns the index of 
the first/last appearance.
Then all batches can be processed in parallel and at the end merged exactly as 
you described.

Once we have index of the first/last appearance - we can use compute.take to 
have the output table.



> [Python] Possibility of a table.drop_duplicates() function?
> ---
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Lance Dacey
>Priority: Major
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-11-11 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632256#comment-17632256
 ] 

Jacek Pliszka commented on ARROW-15474:
---

[~westonpace] maybe approach similar to what I proposed, but in better version 
whould work?

We need compute function that for given array of values returns the index of 
the first/last appearance.
Then all batches can be processed in parallel and at the end merged exactly as 
you described.

Once we have index of the first/last appearance - we can use compute.take to 
have the output table.



> [Python] Possibility of a table.drop_duplicates() function?
> ---
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Lance Dacey
>Priority: Major
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18313) Issues with open_dataset()

2022-11-11 Thread N Gautam Animesh (Jira)
N Gautam Animesh created ARROW-18313:


 Summary: Issues with open_dataset()
 Key: ARROW-18313
 URL: https://issues.apache.org/jira/browse/ARROW-18313
 Project: Apache Arrow
  Issue Type: Bug
Reporter: N Gautam Animesh
 Attachments: image-2022-11-11-09-19-16-065.png

On using open_dataset, it creates a connection due to which the files in the 
directory get blocked and we cannot perform other operations on the file like 
replace!

Actual issue:
 # We are running an atomic operation on a bunch of files, which replaces the 
temp file names to the target file names.
 # But while this is happening, if we try to run open_dataset() on that 
particular directory, the atomic operation is failing and there are both target 
files and temp files in the directory.
 # It is blocking the files that have been read through open_dataset().
 # Please, provide me with more about how we can handle such problems.
 # Snapshot: !image-2022-11-11-09-19-16-065.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18310) [C++] Use atomic backpressure counter

2022-11-11 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-18310:
---

Assignee: Yaron Gvili

> [C++] Use atomic backpressure counter
> -
>
> Key: ARROW-18310
> URL: https://issues.apache.org/jira/browse/ARROW-18310
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> There are a few places in the code (sink_node.cc, source_node.cc, 
> file_base.cc) where the backpressure counter is of type `int32_t`. This 
> prevents `ExecNode::Pause(...)` and  `ExecNode::Resume(...)` from being 
> thread-safe. The proposal is to make these backpressure counters be of type 
> `std::atomic`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18310) [C++] Use atomic backpressure counter

2022-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18310:
---
Labels: pull-request-available  (was: )

> [C++] Use atomic backpressure counter
> -
>
> Key: ARROW-18310
> URL: https://issues.apache.org/jira/browse/ARROW-18310
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few places in the code (sink_node.cc, source_node.cc, 
> file_base.cc) where the backpressure counter is of type `int32_t`. This 
> prevents `ExecNode::Pause(...)` and  `ExecNode::Resume(...)` from being 
> thread-safe. The proposal is to make these backpressure counters be of type 
> `std::atomic`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18312) [C++] Optimize output sizes in segmented aggregation

2022-11-11 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18312:
---

 Summary: [C++] Optimize output sizes in segmented aggregation
 Key: ARROW-18312
 URL: https://issues.apache.org/jira/browse/ARROW-18312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


This is a [follow-up 
task|https://github.com/apache/arrow/pull/14352#discussion_r1019661909] for a 
currently pending PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18311) [C++] Add `Grouper::Reset`

2022-11-11 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18311:
---

 Summary: [C++] Add `Grouper::Reset`
 Key: ARROW-18311
 URL: https://issues.apache.org/jira/browse/ARROW-18311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


Adding `Grouper::Reset` will enable it to be reused in segmented streaming. 
`See [this 
post|https://github.com/apache/arrow/pull/14352#discussion_r1016640969].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18310) [C++] Use atomic backpressure counter

2022-11-11 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18310:
---

 Summary: [C++] Use atomic backpressure counter
 Key: ARROW-18310
 URL: https://issues.apache.org/jira/browse/ARROW-18310
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


There are a few places in the code (sink_node.cc, source_node.cc, file_base.cc) 
where the backpressure counter is of type `int32_t`. This prevents 
`ExecNode::Pause(...)` and  `ExecNode::Resume(...)` from being thread-safe. The 
proposal is to make these backpressure counters be of type 
`std::atomic`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18269) [C++] Slash character in partition value handling

2022-11-11 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632152#comment-17632152
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-18269:
--

[~westonpace] 

So here the context is that, the partition column data is being used to 
formulate the save directory path. When there is a '/' in data, this value get 
implicitly considered as a separator when we form the directory path. Thus 
`A/Z` makes a `A` folder and `Z` inside it. Not sure we can remove that part or 
ask the code to ignore it. 

But, in the reading part, when we recreate the fragments, we could decide 
whether to consider it as a path or just as a single value. If we consider it 
as a path (which is being done at the moment), we would get the erroneous 
output, but if we say don't consider it as a path, but as a non-path, we could 
retrieve the value accurately. 

This is one viable option. If we do that, we can provide a lamda or flag to 
determine this behavior. 

I think a function to determine the key decoding from the file path would be 
better. 

Is this overly complicated or a non-generic solution?

Although I am inclined towards option 1 and not option 2. Option 2 is pretty 
straightforward to do, but a case as mentioned above could be very common.

How is the URL encoding/decoding part relevant here? Am I missing something?

Could you please clarify a bit? 

> [C++] Slash character in partition value handling
> -
>
> Key: ARROW-18269
> URL: https://issues.apache.org/jira/browse/ARROW-18269
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: Vadym Dytyniak
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: good-first-issue
>
>  
> Provided example shows that pyarrow does not handle partition value that 
> contains '/' correctly:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> from pyarrow import dataset as ds
> df = pd.DataFrame({
> 'value': [1, 2],
> 'instrument_id': ['A/Z', 'B'],
> })
> ds.write_dataset(
> data=pa.Table.from_pandas(df),
> base_dir='data',
> format='parquet',
> partitioning=['instrument_id'],
> partitioning_flavor='hive',
> )
> table = ds.dataset(
> source='data',
> format='parquet',
> partitioning='hive',
> ).to_table()
> tables = [table]
> df = pa.concat_tables(tables).to_pandas()  tables = [table]
> df = pa.concat_tables(tables).to_pandas() 
> print(df.head()){code}
> Result:
> {code:java}
>    value instrument_id
> 0      1             A
> 1      2             B {code}
> Expected behaviour:
> Option 1: Result should be:
> {code:java}
>    value instrument_id
> 0      1             A/Z
> 1      2             B {code}
> Option 2: Error should be raised to avoid '/' in partition value.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18272) [pyarrow] ParquetFile does not recognize GCS cloud path as a string

2022-11-11 Thread Zepu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632147#comment-17632147
 ] 

Zepu Zhang commented on ARROW-18272:


Yes, I'm making it work that way for now. Actually I have a case where I don't 
want the convenience of passing a str to it, because I'm processing a large 
number of files, and I don't want it to do the default credential inference for 
each file. So I do this:

 

```

gcs = pyarrow.fs.GcsFileSystem(token=..., credential_token_expiration=...)

parquet_file = 
pyarrow.parquet.ParquetFile(gcs.open_input_file('mybucket/abc/d.parquet'))

```

 

However, API consistency and function signatures suggest `ParquetFile` and 
`read_metadata` should accept the same types of `where`.

> [pyarrow] ParquetFile does not recognize GCS cloud path as a string
> ---
>
> Key: ARROW-18272
> URL: https://issues.apache.org/jira/browse/ARROW-18272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
>Reporter: Zepu Zhang
>Priority: Minor
>
> I have a Parquet file at
>  
> path = 'gs://mybucket/abc/d.parquet'
>  
> `pyarrow.parquet.read_metadata(path)` works fine.
>  
> `pyarrow.parquet.ParquetFile(path)` raises "Failed to open local file 
> 'gs://mybucket/abc/d.parquet'.
>  
> Looks like ParquetFile misses the path resolution logic found in 
> `read_metadata`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)