[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553023#comment-17553023
 ] 

Earle Lyons commented on ARROW-16810:
-

Thanks so much for the response, [~yibocai]! 

I sincerely appreciate the response and information. Your response makes sense 
and seems to align with the behavior. 

I suppose I could add code to remove any parquet files before 
pyarrow.dataset(path, format=custom_csv_format).

However, an argument to read only specific file types would be very helpful.

For example...
 # pyarrow.dataset(path, format=custom_csv_format, filetype='csv')
 # pyarrow.dataset(path, format=custom_csv_format, fileext='.csv')
 # pyarrow.dataset(path/*.csv, format=custom_csv_format)

Per your comment I am CCing [~jorisvandenbossche].

Thanks again! :) 

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 

[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553021#comment-17553021
 ] 

Yibo Cai commented on ARROW-16810:
--

I think pyarrow.dataset.dataset(dir, format='xxx') will read all files under 
the dir and try to parse them as format 'xxx'.
cc [~jorisvandenbossche] for comments.

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16813) [Go][Parquet] Disabling dictionary encoding per column in writer config broken

2022-06-10 Thread Matt DePero (Jira)
Matt DePero created ARROW-16813:
---

 Summary: [Go][Parquet] Disabling dictionary encoding per column in 
writer config broken
 Key: ARROW-16813
 URL: https://issues.apache.org/jira/browse/ARROW-16813
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Matt DePero


Small bug in how we set column level dictionary encoding config, always set to 
true rather than respecting input value.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16813) [Go][Parquet] Disabling dictionary encoding per column in writer config broken

2022-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16813:
---
Labels: pull-request-available  (was: )

> [Go][Parquet] Disabling dictionary encoding per column in writer config broken
> --
>
> Key: ARROW-16813
> URL: https://issues.apache.org/jira/browse/ARROW-16813
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Matt DePero
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Small bug in how we set column level dictionary encoding config, always set 
> to true rather than respecting input value.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552993#comment-17552993
 ] 

Yaron Gvili commented on ARROW-16811:
-

You are referring to this post. This limited-bind is not ideal, though it can 
be useful as an intermediate solution in places in the code that cannot be 
easily changed to a work with a non-default ExecContext. I imagine this could 
be the case in some user-facing APIs that currently do not take an ExecContext, 
and eventually defaults to the global function registry (perhaps examples exist 
in the dataset package?). In such cases, there are two options to consider: 
either break user code to force it to provide an ExecContext, or keep user-code 
intact but fail on runtime when an expression gets bound in a non-safe way. The 
latter one is what I wanted to draw attention to.

> [C++] Remove default exec context from Expression::Bind
> ---
>
> Key: ARROW-16811
> URL: https://issues.apache.org/jira/browse/ARROW-16811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This came up in https://github.com/apache/arrow/pull/13355.
> It is maybe not very intuitive that Expression::Bind would require an 
> ExecContext and so we never provided one.  However, when binding expressions 
> we need to lookup kernels, and that requires a function registry.  Defaulting 
> to default_exec_context is something that should be done at a higher level 
> and so we should not allow ExecContext to be omitted when calling Bind.
> Furthermore, [~rtpsw] has suggested that we might want to split 
> Expression::Bind into two variants.  One which requires an ExecContext and 
> one which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16812) "Edit this page" on docstring generated docs gives 404

2022-06-10 Thread Saul Pwanson (Jira)
Saul Pwanson created ARROW-16812:


 Summary: "Edit this page" on docstring generated docs gives 404
 Key: ARROW-16812
 URL: https://issues.apache.org/jira/browse/ARROW-16812
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Reporter: Saul Pwanson


Clicking on "Edit this page" on e.g. 
[https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_file.html] goes 
to a non-existent page.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Arkadiy Vertleyb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552978#comment-17552978
 ] 

Arkadiy Vertleyb commented on ARROW-16778:
--

[~willjones127] Can you try the following under your 32 bit architecture (where 
tests pass)?

1) break in the TEST_F(TestSetBitRunReader, OneByte)

2) put breakpoints on:
  - SkipNextZeroes
  - CountNextOnes

3) see what is going on.

In my system:
  current_word_ = 182 (1011 0110)
  num_zeros = 0;
  current_word remain 182 -- no zeroes removed 

Then the following asserts because 182 & 1 == 0:

  int64_t CountNextOnes() {
assert(current_word_ & kFirstBit);

I have a feeling something may be wrong with the byte ordering, but I am not 
sure.


> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earle Lyons updated ARROW-16810:

Summary: [Python] PyArrow: write_dataset - Could not open CSV input source  
(was: PyArrow: write_dataset - Could not open CSV input source)

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16810) PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earle Lyons updated ARROW-16810:

Summary: PyArrow: write_dataset - Could not open CSV input source  (was: 
[Python] PyArrow: write_dataset - Could not open CSV input source)

> PyArrow: write_dataset - Could not open CSV input source
> 
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earle Lyons updated ARROW-16810:

Summary: [Python] PyArrow: write_dataset - Could not open CSV input source  
(was: PyArrow: write_dataset - Could not open CSV input source)

> [Python] PyArrow: write_dataset - Could not open CSV input source
> -
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
> Environment
>Reporter: Earle Lyons
>Priority: Minor
>
> Hi Arrow Community! 
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, 
> I am very excited about the project. 
> I am experiencing issues with the '{*}write_dataset'{*} function from the 
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, 
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
> identified a similar issue. 
> I have a directory that contains 90 CSV files (essentially one CSV for each 
> day between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV 
> files into a dataset and write the dataset to a single Parquet file format. 
> Unfortunately, some of the CSV files contained nulls in some columns, which 
> presented some issues which were resolved by specifying DataTypes with the 
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with 
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary 
> for field in (field for field in dataset1.schema \
>               if re.match(field_re_pattern, field.name)):
>         column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a 
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options' 
> # This returns a 
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the 
> # 'csv' file format, use the 'custom_csv_format' and assign to 
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the 
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single 
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source 
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p   A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2' 
> variable now includes the 'part-0.parquet' file (which can be confirmed with 
> the `dataset2.files` output showing the file) and the CSV reader is 
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only 
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
> only CSV files and not all file types in the path. If that is not the current 
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-06-10 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552965#comment-17552965
 ] 

Weston Pace commented on ARROW-16811:
-

[~rtpsw] Where would you see the "bind but don't support functions" variant 
being useful?  I suppose I'm not quite sure I understand the intent.

> [C++] Remove default exec context from Expression::Bind
> ---
>
> Key: ARROW-16811
> URL: https://issues.apache.org/jira/browse/ARROW-16811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This came up in https://github.com/apache/arrow/pull/13355.
> It is maybe not very intuitive that Expression::Bind would require an 
> ExecContext and so we never provided one.  However, when binding expressions 
> we need to lookup kernels, and that requires a function registry.  Defaulting 
> to default_exec_context is something that should be done at a higher level 
> and so we should not allow ExecContext to be omitted when calling Bind.
> Furthermore, [~rtpsw] has suggested that we might want to split 
> Expression::Bind into two variants.  One which requires an ExecContext and 
> one which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-06-10 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-16811:
---

Assignee: Weston Pace

> [C++] Remove default exec context from Expression::Bind
> ---
>
> Key: ARROW-16811
> URL: https://issues.apache.org/jira/browse/ARROW-16811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This came up in https://github.com/apache/arrow/pull/13355.
> It is maybe not very intuitive that Expression::Bind would require an 
> ExecContext and so we never provided one.  However, when binding expressions 
> we need to lookup kernels, and that requires a function registry.  Defaulting 
> to default_exec_context is something that should be done at a higher level 
> and so we should not allow ExecContext to be omitted when calling Bind.
> Furthermore, [~rtpsw] has suggested that we might want to split 
> Expression::Bind into two variants.  One which requires an ExecContext and 
> one which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16811:
---
Labels: pull-request-available  (was: )

> [C++] Remove default exec context from Expression::Bind
> ---
>
> Key: ARROW-16811
> URL: https://issues.apache.org/jira/browse/ARROW-16811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This came up in https://github.com/apache/arrow/pull/13355.
> It is maybe not very intuitive that Expression::Bind would require an 
> ExecContext and so we never provided one.  However, when binding expressions 
> we need to lookup kernels, and that requires a function registry.  Defaulting 
> to default_exec_context is something that should be done at a higher level 
> and so we should not allow ExecContext to be omitted when calling Bind.
> Furthermore, [~rtpsw] has suggested that we might want to split 
> Expression::Bind into two variants.  One which requires an ExecContext and 
> one which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16810) PyArrow: write_dataset - Could not open CSV input source

2022-06-10 Thread Earle Lyons (Jira)
Earle Lyons created ARROW-16810:
---

 Summary: PyArrow: write_dataset - Could not open CSV input source
 Key: ARROW-16810
 URL: https://issues.apache.org/jira/browse/ARROW-16810
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.0
 Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 
Environment
Reporter: Earle Lyons


Hi Arrow Community! 

Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, I 
am very excited about the project. 

I am experiencing issues with the '{*}write_dataset'{*} function from the 
'{*}dataset{*}' module. Please forgive me, if this is a known issue. However, I 
have searched the GitHub 'Issues', as well as Stack Overflow and I have not 
identified a similar issue. 

I have a directory that contains 90 CSV files (essentially one CSV for each day 
between 2021-01-01 and 2021-03-31).  My objective was to read all the CSV files 
into a dataset and write the dataset to a single Parquet file format. 
Unfortunately, some of the CSV files contained nulls in some columns, which 
presented some issues which were resolved by specifying DataTypes with the 
following Stack Overflow solution:

[How do I specify a dtype for all columns when reading a CSV file with 
pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]

The following code works on the first pass.
{code:python}
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
import re
{code}
{code:python}
pa.__version__
'8.0.0'
{code}
{code:python}
column_types = {}
csv_path = '/home/user/csv_files'
field_re_pattern = "value_*"

# Open a dataset with the 'csv_path' path and 'csv' file format
# and assign to 'dataset1'
dataset1 = ds.dataset(csv_path, format='csv')

# Loop through each field in the 'dataset1' schema,
# match the 'field_re_pattern' regex pattern in the field name,
# and assign 'int64' DataType to the field.name in the 'column_types'
# dictionary 
for field in (field for field in dataset1.schema \
              if re.match(field_re_pattern, field.name)):
        column_types[field.name] = pa.int64()

# Creates options for CSV data using the 'column_types' dictionary
# This returns a 
convert_options = csv.ConvertOptions(column_types=column_types)

# Creates FileFormat for CSV using the 'convert_options' 
# This returns a 
custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)

# Open a a dataset with the 'csv_path' path, instead of using the 
# 'csv' file format, use the 'custom_csv_format' and assign to 
# 'dataset2'
dataset2 = ds.dataset(csv_path, format=custom_csv_format)

# Write the 'dataset2' to the 'csv_path' base directory in the 
# 'parquet' format, and overwrite/ignore if the file exists
ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', 
existing_data_behavior='overwrite_or_ignore')
{code}
As previously stated, on first pass, the code works and creates a single 
parquet file (part-0.parquet) with the correct data, row count, and schema.

However, if the code is run again, the following error is encountered:
{code:python}
ArrowInvalid: Could not open CSV input source 
'/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: 
Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V   p   A$A18CEBS
305DEM030TTW �5HZ50GCVJV1CSV
{code}
My interpretation of the error is that on the second pass the 'dataset2' 
variable now includes the 'part-0.parquet' file (which can be confirmed with 
the `dataset2.files` output showing the file) and the CSV reader is attempting 
to parse/read the parquet file.

If this is the case, is there an argument to ignore the parquet file and only 
evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 
'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate 
only CSV files and not all file types in the path. If that is not the current 
behavior.

If this is not the case, any ideas on the cause or solution?

Any assistance would be greatly appreciated.

Thank you and have a great day!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-06-10 Thread Weston Pace (Jira)
Weston Pace created ARROW-16811:
---

 Summary: [C++] Remove default exec context from Expression::Bind
 Key: ARROW-16811
 URL: https://issues.apache.org/jira/browse/ARROW-16811
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


This came up in https://github.com/apache/arrow/pull/13355.

It is maybe not very intuitive that Expression::Bind would require an 
ExecContext and so we never provided one.  However, when binding expressions we 
need to lookup kernels, and that requires a function registry.  Defaulting to 
default_exec_context is something that should be done at a higher level and so 
we should not allow ExecContext to be omitted when calling Bind.

Furthermore, [~rtpsw] has suggested that we might want to split 
Expression::Bind into two variants.  One which requires an ExecContext and one 
which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-10 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-16796.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13355
[https://github.com/apache/arrow/pull/13355]

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Arkadiy Vertleyb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552931#comment-17552931
 ] 

Arkadiy Vertleyb commented on ARROW-16778:
--

Okay, I will see what could be wrong with it.

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552923#comment-17552923
 ] 

Will Jones commented on ARROW-16778:


[~avertleyb] We actually do test 32-bit on MinGW in CI on every PR, just not on 
MSVC.

It's likely there's just something wrong with the bit utility still; validity 
bitmaps are a fundamental part of Arrow Arrays, so it wouldn't be surprising at 
all that a single small issue in bitmap handling would break most tests.

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16746) [C++][Python] S3 tag support on write

2022-06-10 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552922#comment-17552922
 ] 

Steve Loughran commented on ARROW-16746:


yes. we use them a bit in the s3a committers, to annotate a zero byte marker 
file with the length they will finally ;get when manifest at their destination. 
in HADOOP-17833 that's beiing exposed in the createFile(path) buiilder api, 
where apps can set headers at create time. presumably gcs and azure could be 
wired up differently. they both have the advantage that you can edit file 
attributes after creation.

> [C++][Python] S3 tag support on write
> -
>
> Key: ARROW-16746
> URL: https://issues.apache.org/jira/browse/ARROW-16746
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: André Kelpe
>Priority: Major
>
> S3 allows tagging data to better organize ones data 
> ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] 
> We use this for efficient downstream processes/inventory management.
> Currently arrow/pyarrow does not allow tags to be added on write. This is 
> causing us to scan the bucket and re-apply the tags after a pyrrow based 
> process has run.
> I looked through the code and think that it could potentially be done via the 
> metadata mechanism.
> The tags need to be added to the CreateMultipartUploadRequest here: 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156
> See also
> http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel updated ARROW-16807:
-
Description: 
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

pa.compute.count_distinct(starwars.column('sex')).as_py()
#> 15
pa.compute.unique(starwars.column('sex'))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.

  was:
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.


> [C++] count_distinct aggregates incorrectly across row groups
> -
>
> Key: ARROW-16807
> URL: https://issues.apache.org/jira/browse/ARROW-16807
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: > 

[jira] [Comment Edited] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Arkadiy Vertleyb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552918#comment-17552918
 ] 

Arkadiy Vertleyb edited comment on ARROW-16778 at 6/10/22 6:41 PM:
---

[~willjones127] Depends.
Besides crash, multiple tests fail.
Let me ask you - in your best estimate, when was the last time someone ran 32 
bit tests?  It could be badly broken by now...
I am afraid fixing that might be a major task, even for someone closely 
familiar with the system, let alone a new library user like myself.
For your reference:
  Start  1: arrow-array-test
 1/26 Test  #1: arrow-array-test .***Failed0.24 sec
  Start  2: arrow-buffer-test
 2/26 Test  #2: arrow-buffer-test    Passed0.03 sec
  Start  3: arrow-extension-type-test
 3/26 Test  #3: arrow-extension-type-test ***Failed0.02 sec
  Start  4: arrow-misc-test
 4/26 Test  #4: arrow-misc-test ..   Passed0.05 sec
  Start  5: arrow-public-api-test
 5/26 Test  #5: arrow-public-api-test    Passed0.02 sec
  Start  6: arrow-scalar-test
 6/26 Test  #6: arrow-scalar-test ***Failed0.06 sec
  Start  7: arrow-type-test
 7/26 Test  #7: arrow-type-test ..   Passed0.15 sec
  Start  8: arrow-table-test
 8/26 Test  #8: arrow-table-test .***Failed0.05 sec
  Start  9: arrow-tensor-test
 9/26 Test  #9: arrow-tensor-test    Passed0.02 sec
  Start 10: arrow-sparse-tensor-test
10/26 Test #10: arrow-sparse-tensor-test .   Passed0.07 sec
  Start 11: arrow-stl-test
11/26 Test #11: arrow-stl-test ...   Passed0.03 sec
  Start 12: arrow-random-test
12/26 Test #12: arrow-random-test    Passed0.20 sec
  Start 13: arrow-json-integration-test
13/26 Test #13: arrow-json-integration-test ..***Failed0.15 sec
  Start 14: arrow-concatenate-test
14/26 Test #14: arrow-concatenate-test ...***Failed0.02 sec
  Start 15: arrow-c-bridge-test
15/26 Test #15: arrow-c-bridge-test ..***Failed0.06 sec
  Start 16: arrow-io-buffered-test
16/26 Test #16: arrow-io-buffered-test ...   Passed0.08 sec
  Start 17: arrow-io-compressed-test
17/26 Test #17: arrow-io-compressed-test .   Passed0.02 sec
  Start 18: arrow-io-file-test
18/26 Test #18: arrow-io-file-test ...***Failed   10.61 sec
  Start 19: arrow-io-memory-test
19/26 Test #19: arrow-io-memory-test .***Failed1.62 sec
  Start 20: arrow-utility-test
20/26 Test #20: arrow-utility-test ...***Failed3.00 sec
  Start 21: arrow-threading-utility-test
21/26 Test #21: arrow-threading-utility-test .   Passed   39.77 sec
  Start 22: arrow-feather-test
22/26 Test #22: arrow-feather-test ...***Failed0.04 sec
  Start 23: arrow-ipc-json-simple-test
23/26 Test #23: arrow-ipc-json-simple-test ...***Failed0.06 sec
  Start 24: arrow-ipc-read-write-test
24/26 Test #24: arrow-ipc-read-write-test ***Failed8.17 sec
  Start 25: arrow-ipc-tensor-test
25/26 Test #25: arrow-ipc-tensor-test ***Failed1.16 sec
  Start 26: arrow-json-test
26/26 Test #26: arrow-json-test ..***Failed0.03 sec

42% tests passed, 15 tests failed out of 26 


was (Author: JIRAUSER290619):
[~willjones127] Depends.
Besides crash, multiple tests fail.
Let me ask you - in your best estimate, when was the last time someone ran 32 
bit tests?  It could be badly broken by now...
I am afraid fixing that might be a major task, even for someone closely 
familiar with the system, let alone a library user like myself.
For your reference:
  Start  1: arrow-array-test
 1/26 Test  #1: arrow-array-test .***Failed0.24 sec
  Start  2: arrow-buffer-test
 2/26 Test  #2: arrow-buffer-test    Passed0.03 sec
  Start  3: arrow-extension-type-test
 3/26 Test  #3: arrow-extension-type-test ***Failed0.02 sec
  Start  4: arrow-misc-test
 4/26 Test  #4: arrow-misc-test ..   Passed0.05 sec
  Start  5: arrow-public-api-test
 5/26 Test  #5: arrow-public-api-test    Passed0.02 sec
  Start  6: arrow-scalar-test
 6/26 Test  #6: arrow-scalar-test ***Failed0.06 sec
  Start  7: arrow-type-test
 7/26 Test  #7: arrow-type-test ..   Passed0.15 sec
  Start  8: arrow-table-test
 8/26 Test  #8: arrow-table-test .***Failed0.05 sec
  Start  9: arrow-tensor-test
 9/26 Test  #9: arrow-tensor-test    Passed0.02 sec
  Start 10: arrow-sparse-tensor-test
10/26 Test #10: arrow-sparse-tensor-test .   Passed0.07 sec
  Start 11: 

[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Arkadiy Vertleyb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552918#comment-17552918
 ] 

Arkadiy Vertleyb commented on ARROW-16778:
--

[~willjones127] Depends.
Besides crash, multiple tests fail.
Let me ask you - in your best estimate, when was the last time someone ran 32 
bit tests?  It could be badly broken by now...
I am afraid fixing that might be a major task, even for someone closely 
familiar with the system, let alone a library user like myself.
For your reference:
  Start  1: arrow-array-test
 1/26 Test  #1: arrow-array-test .***Failed0.24 sec
  Start  2: arrow-buffer-test
 2/26 Test  #2: arrow-buffer-test    Passed0.03 sec
  Start  3: arrow-extension-type-test
 3/26 Test  #3: arrow-extension-type-test ***Failed0.02 sec
  Start  4: arrow-misc-test
 4/26 Test  #4: arrow-misc-test ..   Passed0.05 sec
  Start  5: arrow-public-api-test
 5/26 Test  #5: arrow-public-api-test    Passed0.02 sec
  Start  6: arrow-scalar-test
 6/26 Test  #6: arrow-scalar-test ***Failed0.06 sec
  Start  7: arrow-type-test
 7/26 Test  #7: arrow-type-test ..   Passed0.15 sec
  Start  8: arrow-table-test
 8/26 Test  #8: arrow-table-test .***Failed0.05 sec
  Start  9: arrow-tensor-test
 9/26 Test  #9: arrow-tensor-test    Passed0.02 sec
  Start 10: arrow-sparse-tensor-test
10/26 Test #10: arrow-sparse-tensor-test .   Passed0.07 sec
  Start 11: arrow-stl-test
11/26 Test #11: arrow-stl-test ...   Passed0.03 sec
  Start 12: arrow-random-test
12/26 Test #12: arrow-random-test    Passed0.20 sec
  Start 13: arrow-json-integration-test
13/26 Test #13: arrow-json-integration-test ..***Failed0.15 sec
  Start 14: arrow-concatenate-test
14/26 Test #14: arrow-concatenate-test ...***Failed0.02 sec
  Start 15: arrow-c-bridge-test
15/26 Test #15: arrow-c-bridge-test ..***Failed0.06 sec
  Start 16: arrow-io-buffered-test
16/26 Test #16: arrow-io-buffered-test ...   Passed0.08 sec
  Start 17: arrow-io-compressed-test
17/26 Test #17: arrow-io-compressed-test .   Passed0.02 sec
  Start 18: arrow-io-file-test
18/26 Test #18: arrow-io-file-test ...***Failed   10.61 sec
  Start 19: arrow-io-memory-test
19/26 Test #19: arrow-io-memory-test .***Failed1.62 sec
  Start 20: arrow-utility-test
20/26 Test #20: arrow-utility-test ...***Failed3.00 sec
  Start 21: arrow-threading-utility-test
21/26 Test #21: arrow-threading-utility-test .   Passed   39.77 sec
  Start 22: arrow-feather-test
22/26 Test #22: arrow-feather-test ...***Failed0.04 sec
  Start 23: arrow-ipc-json-simple-test
23/26 Test #23: arrow-ipc-json-simple-test ...***Failed0.06 sec
  Start 24: arrow-ipc-read-write-test
24/26 Test #24: arrow-ipc-read-write-test ***Failed8.17 sec
  Start 25: arrow-ipc-tensor-test
25/26 Test #25: arrow-ipc-tensor-test ***Failed1.16 sec
  Start 26: arrow-json-test
26/26 Test #26: arrow-json-test ..***Failed0.03 sec

42% tests passed, 15 tests failed out of 26 

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16779) Request for Pyarrow Flight to be shipped in arm64 MacOS version of the wheel

2022-06-10 Thread Ajay Kanagala (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552903#comment-17552903
 ] 

Ajay Kanagala commented on ARROW-16779:
---

Hi Team, We are currently blocked on this feature request.

Can you please help with the timeline for the new feature request and also who 
do we reach out to , to get this feature request prioritized?

> Request for Pyarrow Flight to be shipped in arm64 MacOS version of the wheel
> 
>
> Key: ARROW-16779
> URL: https://issues.apache.org/jira/browse/ARROW-16779
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC
> Environment: Mac M1 OS, Python,
>Reporter: Ajay Kanagala
>Priority: Major
>  Labels: features
>
> This ticket is in continuation to previous ticket 
> "https://issues.apache.org/jira/browse/ARROW-13657;
> It is found that Flight is not shipped in all versions of the wheel. we will 
> also get an import error if you attempt to import pyarrow.gandiva, which is 
> also an optional feature. It is turned off for arm64 MacOS here:
> [https://github.com/apache/arrow/blob/8f0ddc785dd72e950b570f3bc380deb15c124c45/dev/tasks/python-wheels/github.osx.arm64.yml#L26]
>  
> Our team uses Mac M1 processor to work on dremio driver and need access to 
> pyarrow package.
>  
> Can you please add it to the wheel?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-13657) [Python] No module named 'pyarrow._flight' (MacOS)

2022-06-10 Thread Ajay Kanagala (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552902#comment-17552902
 ] 

Ajay Kanagala commented on ARROW-13657:
---

Hi [~willjones127] , Thank you very much for your response. It is really 
helpful. I created a new ticket (ARROW-16779) in regards for Pyarrow Flight to 
be shipped in arm64 MacOS version of the wheel.

What is the timeline for the new feature request and who do we reach out to , 
to get this feature request prioritized?

> [Python] No module named 'pyarrow._flight' (MacOS)
> --
>
> Key: ARROW-13657
> URL: https://issues.apache.org/jira/browse/ARROW-13657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
> Environment: Device: Macbook Air M1 2020 (MacOS Big Sur 11.5.1)
> Python version: Python 3.9.4
> Arrow version: 5.0.0
>Reporter: Dinh Long Nguyen
>Priority: Major
>
> ModuleNotFoundError: No module named 'pyarrow._flight'
> *Error Detail:*
> Traceback (most recent call last):
>  File "arrowtest1/backend/server.py", line 4, in 
>  import pyarrow.flight as fl
>  File 
> ".local/share/virtualenvs/backend-OiVOEXti/lib/python3.9/site-packages/pyarrow/flight.py",
>  line 18, in 
>  from pyarrow._flight import ( # noqa:F401
>  ModuleNotFoundError: No module named 'pyarrow._flight'
> *Device*: Macbook Air M1 2020 (MacOS Big Sur 11.5.1)
> *Python version:* Python 3.9.4
> *Arrow version:* 5.0.0
> *Description*
> Pyarrow works fine, can import other component such as pyarrow.orc, but not 
> pyarrow.flight
> Tried out several machines (intel 4790k, ubuntu), can import pyarrow.flight 
> no problem.
> Even tried out VSCode Dev Container on the same macbook, can also import 
> flight no problem.
> But I cant import when used pipenv directly within macos (I mean no 
> container).
> *Replication process:*
> pipenv install pyarrow
> python
> >> import pyarrow.flight
> Then the error occured
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552899#comment-17552899
 ] 

Will Jones commented on ARROW-16778:


{quote}This builds with three path parameters removed to use defaults and with 
my patch applied.
{quote}
Good!
{quote}But then ctest crashes running arrow-io-file-test.
{quote}
Yup, same here. Would you like to continue debugging yourself? Or else I can 
look into it soon.

 

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Arkadiy Vertleyb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552895#comment-17552895
 ] 

Arkadiy Vertleyb edited comment on ARROW-16778 at 6/10/22 6:02 PM:
---

[~willjones127] This builds with three path parameters removed to use defaults 
and with my patch applied.

But then ctest crashes running arrow-io-file-test. 


was (Author: JIRAUSER290619):
[~willjones127] This builds with three path parameters removed to use defaults 
and with my patch applied.

But ctest crashes running arrow-io-file-test. 

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16749) [Go] Bug when converting from Arrow to Parquet from null array

2022-06-10 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-16749.
---
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13310
[https://github.com/apache/arrow/pull/13310]

> [Go] Bug when converting from Arrow to Parquet from null array
> --
>
> Key: ARROW-16749
> URL: https://issues.apache.org/jira/browse/ARROW-16749
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 9.0.0
>Reporter: Alexandre Crayssac
>Assignee: Alexandre Crayssac
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Hello world,
> When converting from Arrow to Parquet it looks like there is a bug with 
> arrays of type {{{}arrow.NULL{}}}. Here is a snippet of code to reproduce the 
> bug:
>  
> {code:java}
> package main
> import (
>   "fmt"
>   "log"
>   "os"
>   "github.com/apache/arrow/go/v9/arrow"
>   "github.com/apache/arrow/go/v9/arrow/array"
>   "github.com/apache/arrow/go/v9/arrow/memory"
>   "github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> const n = 10
> func run() error {
>   schema := arrow.NewSchema(
>   []arrow.Field{
>   {Name: "f1", Type: arrow.Null, Nullable: true},
>   },
>   nil,
>   )
>   rb := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>   defer rb.Release()
>   for i := 0; i < n; i++ {
>   rb.Field(0).(*array.NullBuilder).AppendNull()
>   }
>   rec := rb.NewRecord()
>   defer rec.Release()
>   for i, col := range rec.Columns() {
>   fmt.Printf("column[%d] %q: %v\n", i, rec.ColumnName(i), col)
>   }
>   f, err := os.Create("output.parquet")
>   if err != nil {
>   return err
>   }
>   defer f.Close()
>   w, err := pqarrow.NewFileWriter(rec.Schema(), f, nil, 
> pqarrow.DefaultWriterProps())
>   if err != nil {
>   return err
>   }
>   defer w.Close()
>   err = w.Write(rec)
>   if err != nil {
>   return err
>   }
>   return nil
> }
> func main() {
>   if err := run(); err != nil {
>   log.Fatal(err)
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Arkadiy Vertleyb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552895#comment-17552895
 ] 

Arkadiy Vertleyb commented on ARROW-16778:
--

[~willjones127] This builds with three path parameters removed to use defaults 
and with my patch applied.

But ctest crashes running arrow-io-file-test. 

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16809) [C++][Benchmarks] Create Filter Benchmark for Acero

2022-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16809:
---
Labels: pull-request-available  (was: )

> [C++][Benchmarks] Create Filter Benchmark for Acero
> ---
>
> Key: ARROW-16809
> URL: https://issues.apache.org/jira/browse/ARROW-16809
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Ivan Chau
>Assignee: Ivan Chau
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16809) [C++][Benchmarks] Create Filter Benchmark for Acero

2022-06-10 Thread Ivan Chau (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Chau updated ARROW-16809:
--
Summary: [C++][Benchmarks] Create Filter Benchmark for Acero  (was: 
[Benchmarks] Create Filter Benchmark for Acero)

> [C++][Benchmarks] Create Filter Benchmark for Acero
> ---
>
> Key: ARROW-16809
> URL: https://issues.apache.org/jira/browse/ARROW-16809
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Ivan Chau
>Assignee: Ivan Chau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16809) [Benchmarks] Create Filter Benchmark for Acero

2022-06-10 Thread Ivan Chau (Jira)
Ivan Chau created ARROW-16809:
-

 Summary: [Benchmarks] Create Filter Benchmark for Acero
 Key: ARROW-16809
 URL: https://issues.apache.org/jira/browse/ARROW-16809
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Benchmarking
Reporter: Ivan Chau
Assignee: Ivan Chau






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero

2022-06-10 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-16716.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13314
[https://github.com/apache/arrow/pull/13314]

> [Benchmarks] Create Projection benchmark for Acero
> --
>
> Key: ARROW-16716
> URL: https://issues.apache.org/jira/browse/ARROW-16716
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Li Jin
>Assignee: Ivan Chau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
> Attachments: out, out_expression
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel updated ARROW-16807:
-
Description: 
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.

  was:
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.


> [C++] count_distinct aggregates incorrectly across row groups
> -
>
> Key: ARROW-16807
> URL: https://issues.apache.org/jira/browse/ARROW-16807
> Project: Apache Arrow
>  Issue Type: Bug
> 

[jira] [Created] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)
Edward Visel created ARROW-16808:


 Summary: [C++] count_distinct aggregates incorrectly across row 
groups
 Key: ARROW-16808
 URL: https://issues.apache.org/jira/browse/ARROW-16808
 Project: Apache Arrow
  Issue Type: Bug
 Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:
   
datasetTRUE
substrait FALSE
parquetTRUE
json   TRUE
s3 TRUE
utf8proc   TRUE
re2TRUE
snappy TRUE
gzip   TRUE
brotli TRUE
zstd   TRUE
lz4TRUE
lz4_frame  TRUE
lzo   FALSE
bz2TRUE
jemalloc   TRUE
mimalloc  FALSE

Memory:
   
Allocator  jemalloc
Current37.25 Kb
Max   925.42 Kb

Runtime:

SIMD Level  none
Detected SIMD Level none

Build:
 
C++ Library Version9.0.0-SNAPSHOT
C++ Compiler   AppleClang
C++ Compiler Version  13.1.6.13160021
Git ID   d9d78946607f36e25e9d812a5cc956bd00ab2bc9
Reporter: Edward Visel
 Fix For: 9.0.0, 8.0.1


When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel closed ARROW-16808.

Resolution: Duplicate

Duplicate of [ARROW-16807]

> [C++] count_distinct aggregates incorrectly across row groups
> -
>
> Key: ARROW-16808
> URL: https://issues.apache.org/jira/browse/ARROW-16808
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: > arrow::arrow_info()
> Arrow package version: 8.0.0.9000
> Capabilities:
>
> datasetTRUE
> substrait FALSE
> parquetTRUE
> json   TRUE
> s3 TRUE
> utf8proc   TRUE
> re2TRUE
> snappy TRUE
> gzip   TRUE
> brotli TRUE
> zstd   TRUE
> lz4TRUE
> lz4_frame  TRUE
> lzo   FALSE
> bz2TRUE
> jemalloc   TRUE
> mimalloc  FALSE
> Memory:
>
> Allocator  jemalloc
> Current37.25 Kb
> Max   925.42 Kb
> Runtime:
> 
> SIMD Level  none
> Detected SIMD Level none
> Build:
>  
> C++ Library Version9.0.0-SNAPSHOT
> C++ Compiler   AppleClang
> C++ Compiler Version  13.1.6.13160021
> Git ID   d9d78946607f36e25e9d812a5cc956bd00ab2bc9
>Reporter: Edward Visel
>Priority: Blocker
> Fix For: 9.0.0, 8.0.1
>
>
> When reading from parquet files with multiple row groups, {{count_distinct}} 
> (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
> {code:r}
> library(dplyr, warn.conflicts = FALSE)
> path <- tempfile(fileext = '.parquet')
> arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
> ds <- arrow::open_dataset(path)
> ds %>% count(sex) %>% collect()
> #> # A tibble: 5 × 2
> #>   sex                n
> #>             
> #> 1 male              60
> #> 2 none               6
> #> 3 female            16
> #> 4 hermaphroditic     1
> #> 5                4
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    19
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> # correct
> ds %>% collect() %>% summarise(n = n_distinct(sex))
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1     5
> {code}
> If the file is stored as a single row group, results are correct. When 
> grouped, results are correct.
> I can reproduce this in Python as well using the same file and 
> {{pyarrow.compute.count_distinct}}:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> pa.__version__
> #> 8.0.0
> starwars = 
> pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
> print(pa.compute.count_distinct(starwars.column('sex')).as_py())
> #> 15
> print(pa.compute.unique(starwars.column('sex')))
> #> [
> #>   "male",
> #>   "none",
> #>   "female",
> #>   "hermaphroditic",
> #>null
> #> ]
> {code}
> This seems likely to be the same problem in this StackOverflow question: 
> https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
>  which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)
Edward Visel created ARROW-16807:


 Summary: [C++] count_distinct aggregates incorrectly across row 
groups
 Key: ARROW-16807
 URL: https://issues.apache.org/jira/browse/ARROW-16807
 Project: Apache Arrow
  Issue Type: Bug
 Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:
   
datasetTRUE
substrait FALSE
parquetTRUE
json   TRUE
s3 TRUE
utf8proc   TRUE
re2TRUE
snappy TRUE
gzip   TRUE
brotli TRUE
zstd   TRUE
lz4TRUE
lz4_frame  TRUE
lzo   FALSE
bz2TRUE
jemalloc   TRUE
mimalloc  FALSE

Memory:
   
Allocator  jemalloc
Current37.25 Kb
Max   925.42 Kb

Runtime:

SIMD Level  none
Detected SIMD Level none

Build:
 
C++ Library Version9.0.0-SNAPSHOT
C++ Compiler   AppleClang
C++ Compiler Version  13.1.6.13160021
Git ID   d9d78946607f36e25e9d812a5cc956bd00ab2bc9
Reporter: Edward Visel
 Fix For: 9.0.0, 8.0.1


When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16756) [C++] Introduce initial ArraySpan, ExecSpan non-owning / shared_ptr-free data structures for kernel execution, refactor scalar kernels

2022-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16756:
---
Labels: pull-request-available  (was: )

> [C++] Introduce initial ArraySpan, ExecSpan non-owning / shared_ptr-free data 
> structures for kernel execution, refactor scalar kernels
> --
>
> Key: ARROW-16756
> URL: https://issues.apache.org/jira/browse/ARROW-16756
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is essential to reduce microperformance overhead as has been discussed 
> and investigated many other places. This first stage of work is to remove the 
> use of {{Datum}} and {{ExecBatch}} from the input side of only scalar 
> kernels, so that we can work toward using span/view data structures as the 
> inputs (and eventually outputs) of all kernels. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552863#comment-17552863
 ] 

Will Jones commented on ARROW-16778:


Wow that will save me a lot of time! :)

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552859#comment-17552859
 ] 

Antoine Pitrou commented on ARROW-16778:


Instead of disabling warnings one-by-one, you can change the 
{{BUILD_WARNING_LEVEL}} CMake variable. I don't know why we are not documenting 
it ([~kou] do you know?), but it is described thusly in 
{{cmake_modules/SetupCxxFlags.cmake}}:

{code}
# BUILD_WARNING_LEVEL add warning/error compiler flags. The possible values are
# - PRODUCTION: Build with `-Wall` but do not add `-Werror`, so warnings do not
#   halt the build.
# - CHECKIN: Build with `-Wall` and `-Wextra`.  Also, add `-Werror` in debug 
mode
#   so that any important warnings fail the build.
# - EVERYTHING: Like `CHECKIN`, but possible extra flags depending on the
#   compiler, including `-Wextra`, `-Weverything`, `-pedantic`.
#   This is the most aggressive warning level.

# Defaults BUILD_WARNING_LEVEL to `CHECKIN`, unless CMAKE_BUILD_TYPE is
# `RELEASE`, then it will default to `PRODUCTION`. The goal of defaulting to
# `CHECKIN` is to avoid friction with long response time from CI.
{code}


> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16746) [C++][Python] S3 tag support on write

2022-06-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552857#comment-17552857
 ] 

Antoine Pitrou commented on ARROW-16746:


[~ste...@apache.org] Thanks for the information. What are "user attributes" in 
this context? Are you talking about "User-defined object metadata" as defined 
in [https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html] ?

> [C++][Python] S3 tag support on write
> -
>
> Key: ARROW-16746
> URL: https://issues.apache.org/jira/browse/ARROW-16746
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: André Kelpe
>Priority: Major
>
> S3 allows tagging data to better organize ones data 
> ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] 
> We use this for efficient downstream processes/inventory management.
> Currently arrow/pyarrow does not allow tags to be added on write. This is 
> causing us to scan the bucket and re-apply the tags after a pyrrow based 
> process has run.
> I looked through the code and think that it could potentially be done via the 
> metadata mechanism.
> The tags need to be added to the CreateMultipartUploadRequest here: 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156
> See also
> http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build

2022-06-10 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552853#comment-17552853
 ] 

Will Jones commented on ARROW-16778:


[~avertleyb] I've been able to get both of the following configurations to 
compile. Do you want to try them out and lmk if they work for you?
{code:none}
@REM Build using Ninja
cmake.EXE ^
 -DARROW_DEPENDENCY_SOURCE=BUNDLED ^
 -DCMAKE_BUILD_TYPE=Debug ^
 -DARROW_BUILD_TESTS=ON ^
 "-DARROW_CXXFLAGS=/wd4244 /wd4554 /wd4018" ^
 -DARROW_BUILD_INTEGRATION=OFF ^
 -DARROW_EXTRA_ERROR_CONTEXT=ON ^
 -DARROW_BUILD_STATIC=OFF ^
 -DARROW_WITH_RE2=OFF ^
 -DARROW_WITH_UTF8PROC=OFF ^
 
-DCMAKE_INSTALL_PREFIX=c:/Users/voltron/arrow/cpp/build/user-cpp-debug-win32-alt/dist
 ^
 -S%USERPROFILE%/arrow/cpp ^
 -B%USERPROFILE%/arrow/cpp/build/user-cpp-debug-win32 ^
 -G Ninja

@REM Or build using Visual Studio
cmake.EXE ^
 -DARROW_DEPENDENCY_SOURCE=BUNDLED ^
 -DCMAKE_BUILD_TYPE=Debug ^
 -DARROW_BUILD_TESTS=ON ^
 "-DARROW_CXXFLAGS=/wd4244 /wd4554 /wd4018" ^
 -DARROW_BUILD_INTEGRATION=OFF ^
 -DARROW_EXTRA_ERROR_CONTEXT=ON ^
 -DARROW_BUILD_STATIC=OFF ^
 -DARROW_WITH_RE2=OFF ^
 -DARROW_WITH_UTF8PROC=OFF ^
 
-DCMAKE_INSTALL_PREFIX=c:/Users/voltron/arrow/cpp/build/user-cpp-debug-win32-alt/dist
 ^
 -S%USERPROFILE%/arrow/cpp ^
 -B%USERPROFILE%/arrow/cpp/build/user-cpp-debug-win32-alt ^
 -G "Visual Studio 16 2019" ^
 -A Win32
 {code}

> [C++] 32 bit MSVC doesn't build
> ---
>
> Key: ARROW-16778
> URL: https://issues.apache.org/jira/browse/ARROW-16778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Win32, MSVC
>Reporter: Arkadiy Vertleyb
>Priority: Major
>
> When specifying Win32 as a platform, and building with MSVC, the build fails 
> with the following compile errors :
> {noformat}
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error 
> C3861: '__popcnt64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error 
> C3861: '_BitScanReverse64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj]
> C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error 
> C3861: '_BitScanForward64': identifier not found 
> [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero

2022-06-10 Thread Ivan Chau (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Chau reassigned ARROW-16716:
-

Assignee: Ivan Chau

> [Benchmarks] Create Projection benchmark for Acero
> --
>
> Key: ARROW-16716
> URL: https://issues.apache.org/jira/browse/ARROW-16716
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Li Jin
>Assignee: Ivan Chau
>Priority: Major
>  Labels: pull-request-available
> Attachments: out, out_expression
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-06-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542244#comment-17542244
 ] 

Antoine Pitrou edited comment on ARROW-15678 at 6/10/22 3:04 PM:
-

On further investigation, we can include immintrin.h with or without -mavx2 and 
clang at least will not complain unless the intrinsics are referenced, so

{code}
#include 

[[gnu::target("avx2")]]
void use_simd() {
  __m256i arg;
  _mm256_abs_epi16 (arg);
}

int main() { use_simd(); }
{code}

compiles and runs happily without any special compilation flags. Using an 
attribute like this seems viable provided we can be certain that the modified 
target isn't transitively applied to functions which might be invoked for the 
first time inside a SIMD enabled function


was (Author: bkietz):
On further investigation, we can include immintrin.h with or without -mavx2 and 
clang at least will not complain unless the intrinsics are referenced, so

{{code}}
#include 

[[gnu::target("avx2")]]
void use_simd() {
  __m256i arg;
  _mm256_abs_epi16 (arg);
}

int main() { use_simd(); }
{{code}}

compiles and runs happily without any special compilation flags. Using an 
attribute like this seems viable provided we can be certain that the modified 
target isn't transitively applied to functions which might be invoked for the 
first time inside a SIMD enabled function

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16759) [Go]

2022-06-10 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-16759.
---
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13322
[https://github.com/apache/arrow/pull/13322]

> [Go]
> 
>
> Key: ARROW-16759
> URL: https://issues.apache.org/jira/browse/ARROW-16759
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Dominic Barnes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The packges under github.com/apache/arrow/go currently have a dependency on 
> github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 
> that has an outstanding security vulnerability. 
> ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq])
> While testify is only used during tests, this is not distinguished by the go 
> toolchain and other tools like Snyk which scan the dependency chain for 
> vulnerabilities. Unfortunately, due to Go's [Minimal version 
> selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up 
> requiring us to visit our dependencies to ensure this security vulnerability 
> is addressed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16746) [C++][Python] S3 tag support on write

2022-06-10 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552769#comment-17552769
 ] 

Steve Loughran commented on ARROW-16746:


hadoop s3a maps user attributes to the filesystem XAttr APIs, very soon to let 
you also set them when you create a file.

> [C++][Python] S3 tag support on write
> -
>
> Key: ARROW-16746
> URL: https://issues.apache.org/jira/browse/ARROW-16746
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: André Kelpe
>Priority: Major
>
> S3 allows tagging data to better organize ones data 
> ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] 
> We use this for efficient downstream processes/inventory management.
> Currently arrow/pyarrow does not allow tags to be added on write. This is 
> causing us to scan the bucket and re-apply the tags after a pyrrow based 
> process has run.
> I looked through the code and think that it could potentially be done via the 
> metadata mechanism.
> The tags need to be added to the CreateMultipartUploadRequest here: 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156
> See also
> http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-5356) [JS] Implement Duration type, integration test support for Interval and Duration types

2022-06-10 Thread Lukas Masuch (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552763#comment-17552763
 ] 

Lukas Masuch commented on ARROW-5356:
-

I'm also running into the same problem ('Unrecognized type: "Duration" (18)'). 
Looking into the code, it seems that the Duration type is only partially 
implemented in javascript/typescript. For example, it fails to decode the 
duration type because it is missing in this switch case:  
[https://github.com/apache/arrow/blob/apache-arrow-9.0.0.dev/js/src/ipc/metadata/message.ts#L440]

Not sure if this is intentional, a bug, or just not implemented yet?

> [JS] Implement Duration type, integration test support for Interval and 
> Duration types
> --
>
> Key: ARROW-5356
> URL: https://issues.apache.org/jira/browse/ARROW-5356
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Assignee: Brian Hulette
>Priority: Major
>
> Follow on work to ARROW-835



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job

2022-06-10 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552760#comment-17552760
 ] 

Raúl Cumplido commented on ARROW-16801:
---

I have tried pinning an old version of minio as seen on this testing PR 
([https://github.com/apache/arrow/pull/13362]) but is not working.

It seems the only available version at brew is the latest one:
{code:java}
% brew info minio
minio: stable 20220508235031 (bottled), HEAD
High Performance, Kubernetes Native Object Storage
https://min.io
/opt/homebrew/Cellar/minio/20210722052332 (7 files, 91.8MB) *
  Poured from bottle on 2021-07-26 at 11:06:45
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/minio.rb
License: AGPL-3.0-or-later
==> Dependencies
Build: go ✘
==> Options
--HEAD
Install HEAD version
==> Caveats
To restart minio after an upgrade:
  brew services restart minio
Or, if you don't want/need a background service you can just run:
  /opt/homebrew/opt/minio/bin/minio server --config-dir=/opt/homebrew/etc/minio 
--address=:9000 /opt/homebrew/var/minio
==> Analytics
install: 1,535 (30 days), 5,097 (90 days), 19,136 (365 days)
install-on-request: 1,534 (30 days), 5,097 (90 days), 19,109 (365 days)
build-error: 0 (30 days)
% brew search minio
==> Formulae
minio ✔  minio-mc minicom  minipro  
minica   minbif   mint

==> Casks
min   
miniwol {code}
[~kou] do you know if there is a way of accessing a previous formula on brew?

> [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
> ---
>
> Key: ARROW-16801
> URL: https://issues.apache.org/jira/browse/ARROW-16801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Anthony Louis Gotlib Ferreira
>Priority: Major
>
> Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the 
> *AMD64 MacOS 10.15 C++* job.
> The error message is big, but one example is below:
> {code:java}
> [   OK ] TestS3FS.GetFileInfoGenerator (55 ms)
> 234[ RUN  ] TestS3FS.CreateDir
> 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: 
> Failure
> 236Failed
> 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got 
> OK
> 238* Closing connection 0
> 239* Closing connection 0
> 240[  FAILED  ] TestS3FS.CreateDir (113 ms)
> 241[ RUN  ] TestS3FS.DeleteFile
> 242* Closing connection 0 {code}
> Here is a PR where that test failed: 
> [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job

2022-06-10 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552739#comment-17552739
 ] 

Raúl Cumplido commented on ARROW-16801:
---

I missed that we were installing minio from brew on some of the MAC jobs: 
[https://github.com/apache/arrow/blob/master/cpp/Brewfile#L32]

We probably should install it manually from a specific version as we do on the 
other jobs.

> [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
> ---
>
> Key: ARROW-16801
> URL: https://issues.apache.org/jira/browse/ARROW-16801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Anthony Louis Gotlib Ferreira
>Priority: Major
>
> Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the 
> *AMD64 MacOS 10.15 C++* job.
> The error message is big, but one example is below:
> {code:java}
> [   OK ] TestS3FS.GetFileInfoGenerator (55 ms)
> 234[ RUN  ] TestS3FS.CreateDir
> 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: 
> Failure
> 236Failed
> 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got 
> OK
> 238* Closing connection 0
> 239* Closing connection 0
> 240[  FAILED  ] TestS3FS.CreateDir (113 ms)
> 241[ RUN  ] TestS3FS.DeleteFile
> 242* Closing connection 0 {code}
> Here is a PR where that test failed: 
> [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job

2022-06-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552729#comment-17552729
 ] 

Antoine Pitrou commented on ARROW-16801:


The CI failure is probably related to a new Minio version. We already pin Minio 
on other CI jobs to avoid such issues:
https://github.com/apache/arrow/blob/master/ci/scripts/install_minio.sh#L54

> [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
> ---
>
> Key: ARROW-16801
> URL: https://issues.apache.org/jira/browse/ARROW-16801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Anthony Louis Gotlib Ferreira
>Priority: Major
>
> Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the 
> *AMD64 MacOS 10.15 C++* job.
> The error message is big, but one example is below:
> {code:java}
> [   OK ] TestS3FS.GetFileInfoGenerator (55 ms)
> 234[ RUN  ] TestS3FS.CreateDir
> 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: 
> Failure
> 236Failed
> 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got 
> OK
> 238* Closing connection 0
> 239* Closing connection 0
> 240[  FAILED  ] TestS3FS.CreateDir (113 ms)
> 241[ RUN  ] TestS3FS.DeleteFile
> 242* Closing connection 0 {code}
> Here is a PR where that test failed: 
> [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job

2022-06-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552726#comment-17552726
 ] 

Antoine Pitrou commented on ARROW-16801:


[~anthonylouis] It is not blocking PRs before maintainers can ignore unrelated 
CI failures.

> [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
> ---
>
> Key: ARROW-16801
> URL: https://issues.apache.org/jira/browse/ARROW-16801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Anthony Louis Gotlib Ferreira
>Priority: Major
>
> Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the 
> *AMD64 MacOS 10.15 C++* job.
> The error message is big, but one example is below:
> {code:java}
> [   OK ] TestS3FS.GetFileInfoGenerator (55 ms)
> 234[ RUN  ] TestS3FS.CreateDir
> 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: 
> Failure
> 236Failed
> 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got 
> OK
> 238* Closing connection 0
> 239* Closing connection 0
> 240[  FAILED  ] TestS3FS.CreateDir (113 ms)
> 241[ RUN  ] TestS3FS.DeleteFile
> 242* Closing connection 0 {code}
> Here is a PR where that test failed: 
> [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16751) [C++] Unify target include directories

2022-06-10 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552718#comment-17552718
 ] 

Jacob Wujciak-Jens commented on ARROW-16751:


+1 for bumping the minimum since 3.5 a lot of [useful 
functionality|https://cliutils.gitlab.io/modern-cmake/chapters/intro/newcmake.html]
 was introduced.

Side note: do we have CI jobs that use 3.5 to make sure that we are actually 
supporting it?

> [C++] Unify target include directories
> --
>
> Key: ARROW-16751
> URL: https://issues.apache.org/jira/browse/ARROW-16751
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Priority: Major
> Fix For: 9.0.0
>
>
> Context: [https://github.com/apache/arrow/pull/13244#discussion_r889780669]
> {{target_include_directories()}} in cmake 3.10 or earlier doesn't support 
> {{INTERFACE}} against {{IMPORTED}} target, so we have to check cmake version 
> like below:
> {code:java}
> if(CMAKE_VERSION VERSION_LESS 3.11)
>   set_target_properties(xsimd PROPERTIES INTERFACE_INCLUDE_DIRECTORIES
>  "${XSIMD_INCLUDE_DIR}")
> else()
>   target_include_directories(xsimd INTERFACE "${XSIMD_INCLUDE_DIR}")
> endif()
> {code}
> Above code is duplicated for some targets. There are also some targets (e.g. 
> ucx::ucx) missed the check.
> We can add a function 
> {{arrow_imported_target_interface_include_directories()}} to make it simpler.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-16805) [R] R crashing with Apple M1 chip

2022-06-10 Thread Gil Henriques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Henriques closed ARROW-16805.
-
Resolution: Fixed

Solution is to install R using the *Apple silicon arm64* build instead of the 
Intel build

.

> [R] R crashing with Apple M1 chip
> -
>
> Key: ARROW-16805
> URL: https://issues.apache.org/jira/browse/ARROW-16805
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: MacBook Pro 13-inch, M1, 2020
> R version 4.1.1 (2021-08-10)
> Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS Monterey 12.1
>Reporter: Gil Henriques
>Priority: Major
>
> When using the \{arrow} package, R crashes as soon as a dplyr verb is used on 
> a parquet object. This does not happen on Windows computers, but I have 
> reproduced it using two separate MacBook Pros with an Apple M1 chip. Crash 
> happens both with RStudio and running R in the command line.
> The reprex below is based on the vignette for \{arrow}, available at 
> [https://arrow.apache.org/docs/r/:]
>  
> {{library(arrow)}}
> {{library(dplyr)}}
> {{write_parquet(starwars, sink = 'sw_parquet')}}
> {{sw <- read_parquet(file = 'sw_parquet', as_data_frame = FALSE)}}
> {{result <- sw %>%}}
> {{  filter(homeworld == "Tatooine")}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job

2022-06-10 Thread Anthony Louis Gotlib Ferreira (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552679#comment-17552679
 ] 

Anthony Louis Gotlib Ferreira commented on ARROW-16801:
---

[~kou] [~apitrou] do you have any idea who can I ask help to fix that problem? 
Because it is blocking some PR's to be merged as the CI is not gree

> [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
> ---
>
> Key: ARROW-16801
> URL: https://issues.apache.org/jira/browse/ARROW-16801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Anthony Louis Gotlib Ferreira
>Priority: Major
>
> Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the 
> *AMD64 MacOS 10.15 C++* job.
> The error message is big, but one example is below:
> {code:java}
> [   OK ] TestS3FS.GetFileInfoGenerator (55 ms)
> 234[ RUN  ] TestS3FS.CreateDir
> 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: 
> Failure
> 236Failed
> 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got 
> OK
> 238* Closing connection 0
> 239* Closing connection 0
> 240[  FAILED  ] TestS3FS.CreateDir (113 ms)
> 241[ RUN  ] TestS3FS.DeleteFile
> 242* Closing connection 0 {code}
> Here is a PR where that test failed: 
> [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16806) [CI][Python] Verify source script fails on Ubuntu 18.04 due to old setuptools version

2022-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16806:
---
Labels: pull-request-available  (was: )

> [CI][Python] Verify source script fails on Ubuntu 18.04 due to old setuptools 
> version
> -
>
> Key: ARROW-16806
> URL: https://issues.apache.org/jira/browse/ARROW-16806
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the verify-rc-source-python-linux-ubuntu-18.04-amd64 job is failing 
> (https://github.com/ursacomputing/crossbow/runs/6814290999?check_suite_focus=true)
>  due to:
> {code:java}
>   File "setup.py", line 37, in 
>     from setuptools import setup, Extension, Distribution, 
> find_namespace_packages
> ImportError: cannot import name 'find_namespace_packages' from 'setuptools' 
> (/tmp/arrow-HEAD.kvwV0/venv-source/lib/python3.8/site-packages/setuptools/__init__.py)
>  {code}
> This change was introduced on this PR 
> [https://github.com/apache/arrow/pull/13309] to fix 
> https://issues.apache.org/jira/browse/ARROW-16726.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16806) [CI][Python] Verify source script fails on Ubuntu 18.04 due to old setuptools version

2022-06-10 Thread Jira
Raúl Cumplido created ARROW-16806:
-

 Summary: [CI][Python] Verify source script fails on Ubuntu 18.04 
due to old setuptools version
 Key: ARROW-16806
 URL: https://issues.apache.org/jira/browse/ARROW-16806
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 9.0.0


Currently the verify-rc-source-python-linux-ubuntu-18.04-amd64 job is failing 
(https://github.com/ursacomputing/crossbow/runs/6814290999?check_suite_focus=true)
 due to:
{code:java}
  File "setup.py", line 37, in 
    from setuptools import setup, Extension, Distribution, 
find_namespace_packages
ImportError: cannot import name 'find_namespace_packages' from 'setuptools' 
(/tmp/arrow-HEAD.kvwV0/venv-source/lib/python3.8/site-packages/setuptools/__init__.py)
 {code}
This change was introduced on this PR 
[https://github.com/apache/arrow/pull/13309] to fix 
https://issues.apache.org/jira/browse/ARROW-16726.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16799) [C++] Create a signal-safe self-pipe abstraction

2022-06-10 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-16799.
--
Resolution: Fixed

Issue resolved by pull request 13354
[https://github.com/apache/arrow/pull/13354]

> [C++] Create a signal-safe self-pipe abstraction
> 
>
> Key: ARROW-16799
> URL: https://issues.apache.org/jira/browse/ARROW-16799
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> A signal-safe self-pipe is already used in the Flight server in order to 
> shutdown the server on an incoming SIGINT.
> We should refactor this to expose a reusable abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16805) [R] R crashing with Apple M1 chip

2022-06-10 Thread Gil Henriques (Jira)
Gil Henriques created ARROW-16805:
-

 Summary: [R] R crashing with Apple M1 chip
 Key: ARROW-16805
 URL: https://issues.apache.org/jira/browse/ARROW-16805
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
 Environment: MacBook Pro 13-inch, M1, 2020
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.1
Reporter: Gil Henriques


When using the \{arrow} package, R crashes as soon as a dplyr verb is used on a 
parquet object. This does not happen on Windows computers, but I have 
reproduced it using two separate MacBook Pros with an Apple M1 chip. Crash 
happens both with RStudio and running R in the command line.

The reprex below is based on the vignette for \{arrow}, available at 
[https://arrow.apache.org/docs/r/:]

 

{{library(arrow)}}

{{library(dplyr)}}

{{write_parquet(starwars, sink = 'sw_parquet')}}

{{sw <- read_parquet(file = 'sw_parquet', as_data_frame = FALSE)}}

{{result <- sw %>%}}
{{  filter(homeworld == "Tatooine")}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16543) [JS] Timestamp types are all the same

2022-06-10 Thread Teodor Kostov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552616#comment-17552616
 ] 

Teodor Kostov commented on ARROW-16543:
---

Here is a full test:

{code:javascript}
const dataType = new arrow.Struct([
  new arrow.Field('time', new arrow.TimestampSecond()),
  new arrow.Field('value', new arrow.Float64()),
])
const builder = arrow.makeBuilder({ type: dataType, nullValues: [null, 
undefined] })

const date = new Date()
const timestampSeconds = Math.floor(date.getTime() / 1000)
const timestamp = timestampSeconds * 1000
builder.append({ time: date, value: 1.2 })
builder.append({ time: date, value: 3.3 })
builder.finish()
const vector = builder.toVector()

const schema = new arrow.Schema(dataType.children)
const recordBatch = new arrow.RecordBatch(schema, vector.data[0])
const table = new arrow.Table(recordBatch)

console.log(timestamp)
console.log(timestampSeconds)
console.log(table.get(0).time)

console.log(table.get(0).time === timestamp) // should be false
console.log(table.get(0).time === timestampSeconds) // should be true
{code}
 

> [JS] Timestamp types are all the same
> -
>
> Key: ARROW-16543
> URL: https://issues.apache.org/jira/browse/ARROW-16543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Teodor Kostov
>Priority: Major
>
> Current timestamp types are all the same. They have the same representation. 
> And also the same precision.
> For example, {{TimestampSecond}} and {{TimestampMillisecond}} return the 
> values as {{165211818}}. Instead, I would expect the {{TimestampSecond}} 
> to drop the 3 zeros when returning a value, e.g. {{1652118180}}. Also, the 
> representation underneath is still an {{int32}} array. Even though for 
> {{TimestampSecond}} every second value is {{0}}, the array still has double 
> the amount of integers.
> I also got an error when trying to read a {{Date}} as {{TimestampNanosecond}} 
> - {{TypeError: can't convert 165211818 to BigInt}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16804) [CI][Conan] Merge upstream changes

2022-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16804:
---
Labels: pull-request-available  (was: )

> [CI][Conan] Merge upstream changes
> --
>
> Key: ARROW-16804
> URL: https://issues.apache.org/jira/browse/ARROW-16804
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16804) [CI][Conan] Merge upstream changes

2022-06-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16804:


 Summary: [CI][Conan] Merge upstream changes
 Key: ARROW-16804
 URL: https://issues.apache.org/jira/browse/ARROW-16804
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552569#comment-17552569
 ] 

Yaron Gvili commented on ARROW-16796:
-

If coding safety is a major concern here (IMHO it is), I'd suggest that in the 
longer-term Arrow code should distinguish between simplification of expressions 
with and without functions/execution, where only the former requires an 
ExecContext whereas only the latter will fail if a function exists in the 
expression. Perhaps the simplest, though likely not ideal, code-change for this 
is by defaulting ExecContext to an implementation that fails.

The purpose of the PR is just to fix in the short-term. Follow-up issues can be 
created for what remains.

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552568#comment-17552568
 ] 

Yaron Gvili commented on ARROW-16796:
-

Copying [Weston Pace's 
post|https://github.com/apache/arrow/pull/13355#issuecomment-1151679039]:

Good catch. I wonder if we should remove the default argument to bind entirely 
(it would look something like 
[westonpace@{{{}c9ae1dd{}}}|https://github.com/westonpace/arrow/commit/c9ae1dd6a0857af69e48a95ec76480f4c466791e]
 ). Looks like there are only a few other non-test spots we call bind.
 * In the parquet reader we convert statistics into expressions and bind them 
to the schema. These expressions will only use min/max and it's only really for 
simplification and not execution so we're probably ok.
 * The scanner has a number of methods that create exec plans (this is the 
"lightweight producer" half of the scanner). We could arguably add an 
ExecContext to scan options but I think it would better to start phasing out 
this half of the scanner in favor of direct use of exec plans instead.

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)