[jira] [Updated] (ARROW-15470) [R] Allows user to specify string to be used for missing data when writing CSV dataset

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15470:
---
Labels: pull-request-available  (was: )

> [R] Allows user to specify string to be used for missing data when writing 
> CSV dataset
> --
>
> Key: ARROW-15470
> URL: https://issues.apache.org/jira/browse/ARROW-15470
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The ability to select the string to be used for missing data was implemented 
> for the CSV Writer in ARROW-14903 and as David Li points out below, is 
> available, so I think we just need to hook it up on the R side.
> This requires the values passed in as the "na" argument to be instead passed 
> through to "null_strings", similarly to what has been done with "skip" and 
> "skip_rows" in ARROW-15743.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18111) [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts)

2022-11-18 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol reassigned ARROW-18111:
-

Assignee: Matthew Topol

> [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts)
> -
>
> Key: ARROW-18111
> URL: https://issues.apache.org/jira/browse/ARROW-18111
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16435) [Tools][Docs] Add instructions on how to collect the produced telemetry data

2022-11-18 Thread Bryce Mecum (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636029#comment-17636029
 ] 

Bryce Mecum commented on ARROW-16435:
-

That'd be great, thanks. I'll work something up in December.

> [Tools][Docs] Add instructions on how to collect the produced telemetry data
> 
>
> Key: ARROW-16435
> URL: https://issues.apache.org/jira/browse/ARROW-16435
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools, Documentation
>Reporter: Matthijs Brobbel
>Assignee: Bryce Mecum
>Priority: Minor
>
> With the ongoing efforts to produce telemetry data in the engine, it might be 
> helpful to collect some notes from the landed PRs about how to collect the 
> telemetry data using the otel-collector and Jaeger running in some containers 
> with docker-compose.
> The goal is to make it trivial with these notes to get started with the 
> telemetry data produced by the engine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16435) [Tools][Docs] Add instructions on how to collect the produced telemetry data

2022-11-18 Thread Bryce Mecum (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryce Mecum reassigned ARROW-16435:
---

Assignee: Bryce Mecum

> [Tools][Docs] Add instructions on how to collect the produced telemetry data
> 
>
> Key: ARROW-16435
> URL: https://issues.apache.org/jira/browse/ARROW-16435
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools, Documentation
>Reporter: Matthijs Brobbel
>Assignee: Bryce Mecum
>Priority: Minor
>
> With the ongoing efforts to produce telemetry data in the engine, it might be 
> helpful to collect some notes from the landed PRs about how to collect the 
> telemetry data using the otel-collector and Jaeger running in some containers 
> with docker-compose.
> The goal is to make it trivial with these notes to get started with the 
> telemetry data produced by the engine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18345) [R] Create a CRAN-specific packaging checklist that lives in the R package directory

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18345:
---
Labels: pull-request-available  (was: )

> [R] Create a  CRAN-specific packaging checklist that lives in the R package 
> directory
> -
>
> Key: ARROW-18345
> URL: https://issues.apache.org/jira/browse/ARROW-18345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Like other packaging tasks, the CRAN packaging task (which is concerned with 
> making sure the R package from the Arrow release complies with CRAN policies) 
> for the R package is slightly different than the overall Arrow release task 
> for the R package. For example, we often push patch-patch releases if the 
> two-week window we get to "safely retain the package on CRAN" does not line 
> up with a release vote. [~npr] has heroically been doing this for a long 
> time, and while he has equally heroically volunteered to keep doing it, I am 
> hoping to process of codifying this somewhere in the R repo will help a wider 
> set of contributors understand the process (even if it was already documented 
> elsewhere!).
> [~stephhazlitt] and I use {{usethis::use_release_issue()}} to manage our 
> personal R package releases, and I'm wondering if creating a similar function 
> or markdown template would help here.
> I'm happy to start the process of putting a PR up for discussion!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15470) [R] Allows user to specify string to be used for missing data when writing CSV dataset

2022-11-18 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-15470:
--

Assignee: Will Jones

> [R] Allows user to specify string to be used for missing data when writing 
> CSV dataset
> --
>
> Key: ARROW-15470
> URL: https://issues.apache.org/jira/browse/ARROW-15470
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>
> The ability to select the string to be used for missing data was implemented 
> for the CSV Writer in ARROW-14903 and as David Li points out below, is 
> available, so I think we just need to hook it up on the R side.
> This requires the values passed in as the "na" argument to be instead passed 
> through to "null_strings", similarly to what has been done with "skip" and 
> "skip_rows" in ARROW-15743.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null

2022-11-18 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636012#comment-17636012
 ] 

Will Jones commented on ARROW-18355:


This feature is "soft-deprecated" in readr. Do we still want to add support?

> [R] support the quoted_na argument in open_dataset for CSVs by mapping it to 
> CSVConvertOptions$strings_can_be_null
> --
>
> Key: ARROW-18355
> URL: https://issues.apache.org/jira/browse/ARROW-18355
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [Parquet][C++] Accelerate bit-packing decoding with AVX-512

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18362:
---
Labels: pull-request-available  (was: )

> [Parquet][C++] Accelerate bit-packing decoding with AVX-512
> ---
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: zhaoyaqi
>Assignee: zhaoyaqi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Accelerate Parquet bit-packing decoding with AVX-512 instructions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [Parquet][C++] Accelerate bit-packing decoding with AVX-512

2022-11-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18362:
-
Summary: [Parquet][C++] Accelerate bit-packing decoding with AVX-512  (was: 
[C++] Accelerate Parquet bit-packing decoding with AVX-512)

> [Parquet][C++] Accelerate bit-packing decoding with AVX-512
> ---
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: zhaoyaqi
>Assignee: zhaoyaqi
>Priority: Major
>
> Accelerate Parquet bit-packing decoding with AVX-512 instructions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with AVX-512

2022-11-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-18362:


Assignee: zhaoyaqi

> [C++] Accelerate Parquet bit-packing decoding with AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Assignee: zhaoyaqi
>Priority: Major
>
> Accelerate Parquet bit-packing decoding with AVX-512 instructions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with AVX-512

2022-11-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18362:
-
Component/s: Parquet

> [C++] Accelerate Parquet bit-packing decoding with AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: zhaoyaqi
>Assignee: zhaoyaqi
>Priority: Major
>
> Accelerate Parquet bit-packing decoding with AVX-512 instructions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null

2022-11-18 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18355:
--

Assignee: Will Jones

> [R] support the quoted_na argument in open_dataset for CSVs by mapping it to 
> CSVConvertOptions$strings_can_be_null
> --
>
> Key: ARROW-18355
> URL: https://issues.apache.org/jira/browse/ARROW-18355
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18365) [C++][Parquet] Optimize DELTA_BINARY_PACKED encoding and decoding

2022-11-18 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-18365:
--

 Summary: [C++][Parquet] Optimize DELTA_BINARY_PACKED encoding and 
decoding
 Key: ARROW-18365
 URL: https://issues.apache.org/jira/browse/ARROW-18365
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Parquet
Reporter: Rok Mihevc


[As suggested 
here|https://github.com/apache/arrow/pull/14191#discussion_r1019762308] simd 
approach such as 
[FastDifferentialCoding|https://github.com/lemire/FastDifferentialCoding] could 
be used to speed up encoding and decoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18323) MIGRATION TEST ISSUE #2

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18323:
---
Labels: pull-request-available  (was: )

> MIGRATION TEST ISSUE #2
> ---
>
> Key: ARROW-18323
> URL: https://issues.apache.org/jira/browse/ARROW-18323
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue was created to help test migration-related process and tooling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18323) MIGRATION TEST ISSUE #2

2022-11-18 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-18323:
---

Assignee: Todd Farmer

> MIGRATION TEST ISSUE #2
> ---
>
> Key: ARROW-18323
> URL: https://issues.apache.org/jira/browse/ARROW-18323
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>
> This issue was created to help test migration-related process and tooling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17457) [C++] Substarit End-To-End Tests for Relations

2022-11-18 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635971#comment-17635971
 ] 

Apache Arrow JIRA Bot commented on ARROW-17457:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++] Substarit End-To-End Tests for Relations
> --
>
> Key: ARROW-17457
> URL: https://issues.apache.org/jira/browse/ARROW-17457
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> At the moment the test coverage for Substrait integration covers the 
> functional tests for serializing and deserializing. But it lacks test for 
> running end-to-end functional tests which proves whether a Substrait plan can 
> deliver the expected outcome. As a part of this, for each relation (Read, 
> Filter, Project, Join, Aggregate) must have end-to-end tests covering the 
> options associated with each relation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17457) [C++] Substarit End-To-End Tests for Relations

2022-11-18 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-17457:
-

Assignee: (was: Vibhatha Lakmal Abeykoon)

> [C++] Substarit End-To-End Tests for Relations
> --
>
> Key: ARROW-17457
> URL: https://issues.apache.org/jira/browse/ARROW-17457
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> At the moment the test coverage for Substrait integration covers the 
> functional tests for serializing and deserializing. But it lacks test for 
> running end-to-end functional tests which proves whether a Substrait plan can 
> deliver the expected outcome. As a part of this, for each relation (Read, 
> Filter, Project, Join, Aggregate) must have end-to-end tests covering the 
> options associated with each relation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-8163) [C++][Dataset] Allow FileSystemDataset's file list to be lazy

2022-11-18 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-8163:


Assignee: (was: Pavel Solodovnikov)

> [C++][Dataset] Allow FileSystemDataset's file list to be lazy
> -
>
> Key: ARROW-8163
> URL: https://issues.apache.org/jira/browse/ARROW-8163
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> A FileSystemDataset currently requires a full listing of files it contains on 
> construction, so a scan cannot start until all files in the dataset are 
> discovered. Instead it would be ideal if a large dataset could be constructed 
> with a lazy file listing so that scans can start immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-8163) [C++][Dataset] Allow FileSystemDataset's file list to be lazy

2022-11-18 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635972#comment-17635972
 ] 

Apache Arrow JIRA Bot commented on ARROW-8163:
--

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Dataset] Allow FileSystemDataset's file list to be lazy
> -
>
> Key: ARROW-8163
> URL: https://issues.apache.org/jira/browse/ARROW-8163
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Pavel Solodovnikov
>Priority: Major
>  Labels: dataset
>
> A FileSystemDataset currently requires a full listing of files it contains on 
> construction, so a scan cannot start until all files in the dataset are 
> discovered. Instead it would be ideal if a large dataset could be constructed 
> with a lazy file listing so that scans can start immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18364) MIGRATION: Update GitHub issue templates to support bug reports and feature requests

2022-11-18 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-18364:
---

Assignee: Todd Farmer

> MIGRATION: Update GitHub issue templates to support bug reports and feature 
> requests
> 
>
> Key: ARROW-18364
> URL: https://issues.apache.org/jira/browse/ARROW-18364
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>
> The [GitHub issue creation page for 
> Arrow|https://github.com/apache/arrow/issues/new/choose] directs users to 
> open bug reports in Jira. Now that ASF Infra has disabled self-service 
> registration in Jira, and in light of the pending migration of Apache Arrow 
> issue tracking from ASF Jira to GitHub issues, we should enable bug reports 
> to be submitted via GitHub directly. Issue templates will help distinguish 
> bug reports and feature requests from existing usage assistance questions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18364) MIGRATION: Update GitHub issue templates to support bug reports and feature requests

2022-11-18 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-18364:
---

 Summary: MIGRATION: Update GitHub issue templates to support bug 
reports and feature requests
 Key: ARROW-18364
 URL: https://issues.apache.org/jira/browse/ARROW-18364
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


The [GitHub issue creation page for 
Arrow|https://github.com/apache/arrow/issues/new/choose] directs users to open 
bug reports in Jira. Now that ASF Infra has disabled self-service registration 
in Jira, and in light of the pending migration of Apache Arrow issue tracking 
from ASF Jira to GitHub issues, we should enable bug reports to be submitted 
via GitHub directly. Issue templates will help distinguish bug reports and 
feature requests from existing usage assistance questions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18314) [R] "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R

2022-11-18 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635963#comment-17635963
 ] 

Nicola Crane commented on ARROW-18314:
--

Hmm, not sure what to suggest here, though I wonder if this has similar causes 
as ARROW-18313

> [R] "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes  
> CPP11::unwind_execption, crashed R
> --
>
> Key: ARROW-18314
> URL: https://issues.apache.org/jira/browse/ARROW-18314
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
> Attachments: image-2022-11-11-14-55-36-430.png, 
> image-2022-11-11-14-59-30-132.png
>
>
> This is running on a windows environment, arrow 10.0.0 (see arrow_info() 
> below). data size is large maybe 
> I issued two calls
> ```
> ft <- path_to_dataset1
> fa <- path_to_dataset2
> #1)
> tic()
> d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> toc()
> 927.11 sec elapsed
> #returned a dataset with 44 obs, 38 columns, took abnormal time, 16min
> #1)
> tic()
> d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
> terminate called after throwing an instance of 'cpp11::unwind_exception'
> ```
> Then I got an error that craspad_hendler.exe stopped working. And R becomes 
> frozen, after a while R crashed too.
> !image-2022-11-11-14-59-30-132.png!
>  
> arrow_info()
> Arrow package version: 10.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> gcs        TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc  FALSE
> mimalloc   TRUE
> Arrow options():
>                        
> arrow.use_threads FALSE
> Memory:
>                   
> Allocator mimalloc
> Current    0 bytes
> Max        0 bytes
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                                                              
> C++ Library Version                                    10.0.0
> C++ Compiler                                              GNU
> C++ Compiler Version                                   10.3.0
> Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18332) [Go] Casting Dictionary types to their value type

2022-11-18 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-18332.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14650
[https://github.com/apache/arrow/pull/14650]

> [Go] Casting Dictionary types to their value type
> -
>
> Key: ARROW-18332
> URL: https://issues.apache.org/jira/browse/ARROW-18332
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18110) [Go] Scalar Comparisons

2022-11-18 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-18110:
--
Component/s: Go
 (was: GPU)

> [Go] Scalar Comparisons
> ---
>
> Key: ARROW-18110
> URL: https://issues.apache.org/jira/browse/ARROW-18110
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635876#comment-17635876
 ] 

Joris Van den Bossche commented on ARROW-18363:
---

There is also some work to include this upstream in the sphinx theme: 
https://github.com/pydata/pydata-sphinx-theme/pull/780

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635873#comment-17635873
 ] 

Joris Van den Bossche commented on ARROW-18363:
---

For example the MNE docs mentioned above do this with a piece of javascript: 
https://github.com/mne-tools/mne-tools.github.io/blob/main/versionwarning.js

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with ICX AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18362:
---
Summary: [C++] Accelerate Parquet bit-packing decoding with ICX AVX-512  
(was: Accelerate Parquet bit-packing decoding with ICX AVX-512)

> [C++] Accelerate Parquet bit-packing decoding with ICX AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18362:
---
Summary: [C++] Accelerate Parquet bit-packing decoding with AVX-512  (was: 
[C++] Accelerate Parquet bit-packing decoding with ICX AVX-512)

> [C++] Accelerate Parquet bit-packing decoding with AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18362:
---
Description: Accelerate Parquet bit-packing decoding with AVX-512 
instructions?  (was: h1. Accelerate Parquet bit-packing decoding with ICX 
AVX-512 instructions)

> [C++] Accelerate Parquet bit-packing decoding with AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> Accelerate Parquet bit-packing decoding with AVX-512 instructions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635864#comment-17635864
 ] 

Joris Van den Bossche commented on ARROW-18363:
---

Renamed the issue to not be specific about the contributing docs, since we can 
also do this for all docs. I think it would only still be nice if we can 
special case pages in the /developers section, so that for those pages we can 
1) point to the dev docs instead of stable docs, and 2) also show this warning 
for the stable version.

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18362) Accelerate Parquet bit-packing decoding with ICX AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635863#comment-17635863
 ] 

Antoine Pitrou commented on ARROW-18362:


Are you willing to contribute this?

> Accelerate Parquet bit-packing decoding with ICX AVX-512
> 
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18363:
--
Summary: [Docs] Include warning when viewing old docs (redirecting to 
stable/dev docs)  (was: [Docs] Include warning when viewing old contributing 
docs (redirecting to dev docs))

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18363) [Docs] Include warning when viewing old contributing docs (redirecting to dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18363:
-

 Summary: [Docs] Include warning when viewing old contributing docs 
(redirecting to dev docs)
 Key: ARROW-18363
 URL: https://issues.apache.org/jira/browse/ARROW-18363
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche


Now we have versioned docs, we also have the old versions of the developers 
docs (eg 
https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
might be outdated (eg regarding communication channels, build instructions, 
etc), and typically when contributing / developing with the latest arrow, one 
should _always_ check the latest dev version of the contributing docs.

We could add a warning box pointing this out and linking to the dev docs. 

For example similarly how some projects warn about viewing old docs in general 
and point to the stable docs (eg https://mne.tools/1.1/index.html or 
https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
custom box when at a page in /developers to point to the dev docs instead of 
stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18298) [Python] datetime shifted when using pyarrow.Table.from_pandas to load a pandas DateFrame containing datetime with timezone

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635819#comment-17635819
 ] 

Joris Van den Bossche commented on ARROW-18298:
---

bq. I thought initially it was just how it was presented, as going back to 
pandas in this example from the table gives the "correct" representation of the 
value:

Yes, this is in this case the cause of the confusion. The dates are not "wrong" 
after conversion to arrow, they are just confusingly printed in UTC without any 
indication of this. We have ARROW-14567 to track this issue.

bq. However, placing mixed timezones makes the behavior more apparent in that 
it is coercing to the first timezone.

That's a separate issue (and something that doesn't happen that often, for 
example also pandas requires a single timezone for a column, if you have a 
datetime64 dtype). But indeed, Arrow's timestamp type requires a single 
timezone, and thus when encountering multiple ones, we currently coerce to the 
first one. I think it would be better to coerce to UTC instead (-> ARROW-5912). 
There is some discussion about the use case of actually having multiple 
timezones in a single array at ARROW-16540



> [Python] datetime shifted when using pyarrow.Table.from_pandas to load a 
> pandas DateFrame containing datetime with timezone
> ---
>
> Key: ARROW-18298
> URL: https://issues.apache.org/jira/browse/ARROW-18298
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: MacOS M1, Python 3.8.13
>Reporter: Adam Ling
>Priority: Major
>
> Problem:
> When using pyarrow.Table.from_pandas to load a pandas DataFrame which 
> contains a timestamp object with timezone information, the created Table 
> object will shift the datetime, while still keeping the timezone information. 
> Please see my scripts.
>  
> Reproduce scripts:
> {code:java}
> import pandas as pd
> import pyarrow
> ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
> df = pd.DataFrame({"TS": [ts]})
> table = pyarrow.Table.from_pandas(df)
> print(df)
> """
>  TS
> 0 2022-10-21 22:46:17-07:00
> """
> print(table)
> """
> pyarrow.Table
> TS: timestamp[ns, tz=America/Los_Angeles]
> 
> TS: [[2022-10-22 05:46:17.0]]""" {code}
> Expected results:
> The table should not shift the datetime when timezone information is provided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18362) Accelerate Parquet bit-packing decoding with ICX AVX-512

2022-11-18 Thread zhaoyaqi (Jira)
zhaoyaqi created ARROW-18362:


 Summary: Accelerate Parquet bit-packing decoding with ICX AVX-512
 Key: ARROW-18362
 URL: https://issues.apache.org/jira/browse/ARROW-18362
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: zhaoyaqi


h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17136) [C++] HadoopFileSystem open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Summary: [C++] HadoopFileSystem open_append_stream throwing an error if 
file does not exists  (was: [C++] open_append_stream throwing an error if file 
does not exists)

> [C++] HadoopFileSystem open_append_stream throwing an error if file does not 
> exists
> ---
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(Fil

[jira] [Updated] (ARROW-17136) [C++] HadoopFileSystem open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Labels: good-first-issue  (was: )

> [C++] HadoopFileSystem open_append_stream throwing an error if file does not 
> exists
> ---
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>  Labels: good-first-issue
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append

[jira] [Updated] (ARROW-17136) [C++] open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Component/s: C++
 (was: Python)

> [C++] open_append_stream throwing an error if file does not exists
> --
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:431)
>         at 
> org.apache

[jira] [Updated] (ARROW-17136) [C++] open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Summary: [C++] open_append_stream throwing an error if file does not exists 
 (was: open_append_stream throwing an error if file does not exists)

> [C++] open_append_stream throwing an error if file does not exists
> --
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.ha

[jira] [Comment Edited] (ARROW-18276) [Python] Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] Opening HDFS file

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635303#comment-17635303
 ] 

Joris Van den Bossche edited comment on ARROW-18276 at 11/18/22 9:43 AM:
-

Hi [~moritzmeister] !

Could you try using {{pyarrow}} directly to see if you then get the same error 
when opening the file? 
You can instantiate a {{HadoopFileSystem}} object [from an URI 
string|https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html#pyarrow.fs.HadoopFileSystem.from_uri],
 or using the class constructor directly 
(https://arrow.apache.org/docs/dev/python/filesystems.html#hadoop-distributed-file-system-hdfs).
 Something similar to this:

{code}
from pyarrow import fs
hdfs, _ = 
fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
hdfs.open_input_file("/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/part-0-42b57ad2-57eb-4a63-bfaa-7375e82863e8-c000.csv")
{code}

If that works, you can then use {{hdfs}} with {{{}fsspec{}}}:
[https://arrow.apache.org/docs/python/filesystems.html#using-arrow-filesystems-with-fsspec]

and {{fsspec}} API to open the files:
[https://filesystem-spec.readthedocs.io/en/latest/api.html]

Something similar to this:
{code:python}
from pyarrow import fs
hdfs = 
fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
from fsspec.implementations.arrow import ArrowFSWrapper
hdfs_fsspec = ArrowFSWrapper(hdfs)
hdfs_fsspec.open_files(...)
{code}
This way you can see if pyarrow 10.0.0 works or errors. And it is more direct 
so less likely to error :)

Also, do you maybe know if the Hadoop installation has changed in this time?


was (Author: alenkaf):
Hi [~moritzmeister] !

Could you try using {{pyarrow}} directly?
You can instantiate {{HadoopFileSystem}} object [from an URI 
string|https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html#pyarrow.fs.HadoopFileSystem.from_uri].

If that works, you can then use {{hdfs}} with {{{}fsspec{}}}:
[https://arrow.apache.org/docs/python/filesystems.html#using-arrow-filesystems-with-fsspec]

and {{fsspec}} API to open the files:
[https://filesystem-spec.readthedocs.io/en/latest/api.html]

Something similar to this:
{code:python}
from pyarrow import fs
hdfs = 
fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
from fsspec.implementations.arrow import ArrowFSWrapper
hdfs_fsspec = ArrowFSWrapper(hdfs)
hdfs_fsspec.open_files(...)
{code}
This way you can see if pyarrow 10.0.0 works or errors. And it is more direct 
so less likely to error :)

Also, do you maybe know if the Hadoop installation has changed in this time?

> [Python] Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] 
> Opening HDFS file
> 
>
> Key: ARROW-18276
> URL: https://issues.apache.org/jira/browse/ARROW-18276
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
> Environment: pyarrow 10.0.0
> fsspec 2022.7.1
> pandas 1.3.3
> python 3.8.11.
>Reporter: Moritz Meister
>Priority: Major
>
> Hey!
> I am trying to read a CSV file using pyarrow together with fsspec from HDFS.
> I used to do this with pyarrow 9.0.0 and fsspec 2022.7.1, however, after I 
> upgraded to pyarrow 10.0.0 this stopped working.
> I am not quite sure if this is an incompatibility introduced in the new 
> pyarrow version or if it is a Bug in fsspec. So if I am in the wrong place 
> here, please let me know.
> Apart from pyarrow 10.0.0 and fsspec 2022.7.1, I am using pandas version 
> 1.3.3 and python 3.8.11.
> Here is the full stack trace
> {code:python}
> pd.read_csv("hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/part-0-42b57ad2-57eb-4a63-bfaa-7375e82863e8-c000.csv")
> ---
> OSError                                   Traceback (most recent call last)
> /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py
>  in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, 
> usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, 
> true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, 
> na_values, keep_default_na, na_filter, verbose, skip_blank_lines, 
> parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, 
> cache_dates, iterator, chunksize, comp