[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source
[ https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553023#comment-17553023 ] Earle Lyons commented on ARROW-16810: - Thanks so much for the response, [~yibocai]! I sincerely appreciate the response and information. Your response makes sense and seems to align with the behavior. I suppose I could add code to remove any parquet files before pyarrow.dataset(path, format=custom_csv_format). However, an argument to read only specific file types would be very helpful. For example... # pyarrow.dataset(path, format=custom_csv_format, filetype='csv') # pyarrow.dataset(path, format=custom_csv_format, fileext='.csv') # pyarrow.dataset(path/*.csv, format=custom_csv_format) Per your comment I am CCing [~jorisvandenbossche]. Thanks again! :) > [Python] PyArrow: write_dataset - Could not open CSV input source > - > > Key: ARROW-16810 > URL: https://issues.apache.org/jira/browse/ARROW-16810 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 > Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 > Environment >Reporter: Earle Lyons >Priority: Minor > > Hi Arrow Community! > Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, > I am very excited about the project. > I am experiencing issues with the '{*}write_dataset'{*} function from the > '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, > I have searched the GitHub 'Issues', as well as Stack Overflow and I have not > identified a similar issue. > I have a directory that contains 90 CSV files (essentially one CSV for each > day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV > files into a dataset and write the dataset to a single Parquet file format. > Unfortunately, some of the CSV files contained nulls in some columns, which > presented some issues which were resolved by specifying DataTypes with the > following Stack Overflow solution: > [How do I specify a dtype for all columns when reading a CSV file with > pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] > The following code works on the first pass. > {code:python} > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as ds > import re > {code} > {code:python} > pa.__version__ > '8.0.0' > {code} > {code:python} > column_types = {} > csv_path = '/home/user/csv_files' > field_re_pattern = "value_*" > # Open a dataset with the 'csv_path' path and 'csv' file format > # and assign to 'dataset1' > dataset1 = ds.dataset(csv_path, format='csv') > # Loop through each field in the 'dataset1' schema, > # match the 'field_re_pattern' regex pattern in the field name, > # and assign 'int64' DataType to the field.name in the 'column_types' > # dictionary > for field in (field for field in dataset1.schema \ > if re.match(field_re_pattern, field.name)): > column_types[field.name] = pa.int64() > # Creates options for CSV data using the 'column_types' dictionary > # This returns a > convert_options = csv.ConvertOptions(column_types=column_types) > # Creates FileFormat for CSV using the 'convert_options' > # This returns a > custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) > # Open a a dataset with the 'csv_path' path, instead of using the > # 'csv' file format, use the 'custom_csv_format' and assign to > # 'dataset2' > dataset2 = ds.dataset(csv_path, format=custom_csv_format) > # Write the 'dataset2' to the 'csv_path' base directory in the > # 'parquet' format, and overwrite/ignore if the file exists > ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', > existing_data_behavior='overwrite_or_ignore') > {code} > As previously stated, on first pass, the code works and creates a single > parquet file (part-0.parquet) with the correct data, row count, and schema. > However, if the code is run again, the following error is encountered: > {code:python} > ArrowInvalid: Could not open CSV input source > '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: > Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS > 305DEM030TTW �5HZ50GCVJV1CSV > {code} > My interpretation of the error is that on the second pass the 'dataset2' > variable now includes the 'part-0.parquet' file (which can be confirmed with > the `dataset2.files` output showing the file) and the CSV reader is > attempting to parse/read the parquet file. > If this is the case, is there an argument to ignore the parquet file and only > evaluate the CSV files? Also, if a dataset object has a format of 'csv' or >
[jira] [Commented] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source
[ https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553021#comment-17553021 ] Yibo Cai commented on ARROW-16810: -- I think pyarrow.dataset.dataset(dir, format='xxx') will read all files under the dir and try to parse them as format 'xxx'. cc [~jorisvandenbossche] for comments. > [Python] PyArrow: write_dataset - Could not open CSV input source > - > > Key: ARROW-16810 > URL: https://issues.apache.org/jira/browse/ARROW-16810 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 > Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 > Environment >Reporter: Earle Lyons >Priority: Minor > > Hi Arrow Community! > Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, > I am very excited about the project. > I am experiencing issues with the '{*}write_dataset'{*} function from the > '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, > I have searched the GitHub 'Issues', as well as Stack Overflow and I have not > identified a similar issue. > I have a directory that contains 90 CSV files (essentially one CSV for each > day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV > files into a dataset and write the dataset to a single Parquet file format. > Unfortunately, some of the CSV files contained nulls in some columns, which > presented some issues which were resolved by specifying DataTypes with the > following Stack Overflow solution: > [How do I specify a dtype for all columns when reading a CSV file with > pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] > The following code works on the first pass. > {code:python} > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as ds > import re > {code} > {code:python} > pa.__version__ > '8.0.0' > {code} > {code:python} > column_types = {} > csv_path = '/home/user/csv_files' > field_re_pattern = "value_*" > # Open a dataset with the 'csv_path' path and 'csv' file format > # and assign to 'dataset1' > dataset1 = ds.dataset(csv_path, format='csv') > # Loop through each field in the 'dataset1' schema, > # match the 'field_re_pattern' regex pattern in the field name, > # and assign 'int64' DataType to the field.name in the 'column_types' > # dictionary > for field in (field for field in dataset1.schema \ > if re.match(field_re_pattern, field.name)): > column_types[field.name] = pa.int64() > # Creates options for CSV data using the 'column_types' dictionary > # This returns a > convert_options = csv.ConvertOptions(column_types=column_types) > # Creates FileFormat for CSV using the 'convert_options' > # This returns a > custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) > # Open a a dataset with the 'csv_path' path, instead of using the > # 'csv' file format, use the 'custom_csv_format' and assign to > # 'dataset2' > dataset2 = ds.dataset(csv_path, format=custom_csv_format) > # Write the 'dataset2' to the 'csv_path' base directory in the > # 'parquet' format, and overwrite/ignore if the file exists > ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', > existing_data_behavior='overwrite_or_ignore') > {code} > As previously stated, on first pass, the code works and creates a single > parquet file (part-0.parquet) with the correct data, row count, and schema. > However, if the code is run again, the following error is encountered: > {code:python} > ArrowInvalid: Could not open CSV input source > '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: > Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS > 305DEM030TTW �5HZ50GCVJV1CSV > {code} > My interpretation of the error is that on the second pass the 'dataset2' > variable now includes the 'part-0.parquet' file (which can be confirmed with > the `dataset2.files` output showing the file) and the CSV reader is > attempting to parse/read the parquet file. > If this is the case, is there an argument to ignore the parquet file and only > evaluate the CSV files? Also, if a dataset object has a format of 'csv' or > 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate > only CSV files and not all file types in the path. If that is not the current > behavior. > If this is not the case, any ideas on the cause or solution? > Any assistance would be greatly appreciated. > Thank you and have a great day! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16813) [Go][Parquet] Disabling dictionary encoding per column in writer config broken
Matt DePero created ARROW-16813: --- Summary: [Go][Parquet] Disabling dictionary encoding per column in writer config broken Key: ARROW-16813 URL: https://issues.apache.org/jira/browse/ARROW-16813 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: Matt DePero Small bug in how we set column level dictionary encoding config, always set to true rather than respecting input value. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16813) [Go][Parquet] Disabling dictionary encoding per column in writer config broken
[ https://issues.apache.org/jira/browse/ARROW-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16813: --- Labels: pull-request-available (was: ) > [Go][Parquet] Disabling dictionary encoding per column in writer config broken > -- > > Key: ARROW-16813 > URL: https://issues.apache.org/jira/browse/ARROW-16813 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Matt DePero >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Small bug in how we set column level dictionary encoding config, always set > to true rather than respecting input value. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16811) [C++] Remove default exec context from Expression::Bind
[ https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552993#comment-17552993 ] Yaron Gvili commented on ARROW-16811: - You are referring to this post. This limited-bind is not ideal, though it can be useful as an intermediate solution in places in the code that cannot be easily changed to a work with a non-default ExecContext. I imagine this could be the case in some user-facing APIs that currently do not take an ExecContext, and eventually defaults to the global function registry (perhaps examples exist in the dataset package?). In such cases, there are two options to consider: either break user code to force it to provide an ExecContext, or keep user-code intact but fail on runtime when an expression gets bound in a non-safe way. The latter one is what I wanted to draw attention to. > [C++] Remove default exec context from Expression::Bind > --- > > Key: ARROW-16811 > URL: https://issues.apache.org/jira/browse/ARROW-16811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > This came up in https://github.com/apache/arrow/pull/13355. > It is maybe not very intuitive that Expression::Bind would require an > ExecContext and so we never provided one. However, when binding expressions > we need to lookup kernels, and that requires a function registry. Defaulting > to default_exec_context is something that should be done at a higher level > and so we should not allow ExecContext to be omitted when calling Bind. > Furthermore, [~rtpsw] has suggested that we might want to split > Expression::Bind into two variants. One which requires an ExecContext and > one which does not (but fails if it encounters a "call"). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16812) "Edit this page" on docstring generated docs gives 404
Saul Pwanson created ARROW-16812: Summary: "Edit this page" on docstring generated docs gives 404 Key: ARROW-16812 URL: https://issues.apache.org/jira/browse/ARROW-16812 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Saul Pwanson Clicking on "Edit this page" on e.g. [https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_file.html] goes to a non-existent page. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552978#comment-17552978 ] Arkadiy Vertleyb commented on ARROW-16778: -- [~willjones127] Can you try the following under your 32 bit architecture (where tests pass)? 1) break in the TEST_F(TestSetBitRunReader, OneByte) 2) put breakpoints on: - SkipNextZeroes - CountNextOnes 3) see what is going on. In my system: current_word_ = 182 (1011 0110) num_zeros = 0; current_word remain 182 -- no zeroes removed Then the following asserts because 182 & 1 == 0: int64_t CountNextOnes() { assert(current_word_ & kFirstBit); I have a feeling something may be wrong with the byte ordering, but I am not sure. > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source
[ https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earle Lyons updated ARROW-16810: Summary: [Python] PyArrow: write_dataset - Could not open CSV input source (was: PyArrow: write_dataset - Could not open CSV input source) > [Python] PyArrow: write_dataset - Could not open CSV input source > - > > Key: ARROW-16810 > URL: https://issues.apache.org/jira/browse/ARROW-16810 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 > Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 > Environment >Reporter: Earle Lyons >Priority: Minor > > Hi Arrow Community! > Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, > I am very excited about the project. > I am experiencing issues with the '{*}write_dataset'{*} function from the > '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, > I have searched the GitHub 'Issues', as well as Stack Overflow and I have not > identified a similar issue. > I have a directory that contains 90 CSV files (essentially one CSV for each > day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV > files into a dataset and write the dataset to a single Parquet file format. > Unfortunately, some of the CSV files contained nulls in some columns, which > presented some issues which were resolved by specifying DataTypes with the > following Stack Overflow solution: > [How do I specify a dtype for all columns when reading a CSV file with > pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] > The following code works on the first pass. > {code:python} > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as ds > import re > {code} > {code:python} > pa.__version__ > '8.0.0' > {code} > {code:python} > column_types = {} > csv_path = '/home/user/csv_files' > field_re_pattern = "value_*" > # Open a dataset with the 'csv_path' path and 'csv' file format > # and assign to 'dataset1' > dataset1 = ds.dataset(csv_path, format='csv') > # Loop through each field in the 'dataset1' schema, > # match the 'field_re_pattern' regex pattern in the field name, > # and assign 'int64' DataType to the field.name in the 'column_types' > # dictionary > for field in (field for field in dataset1.schema \ > if re.match(field_re_pattern, field.name)): > column_types[field.name] = pa.int64() > # Creates options for CSV data using the 'column_types' dictionary > # This returns a > convert_options = csv.ConvertOptions(column_types=column_types) > # Creates FileFormat for CSV using the 'convert_options' > # This returns a > custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) > # Open a a dataset with the 'csv_path' path, instead of using the > # 'csv' file format, use the 'custom_csv_format' and assign to > # 'dataset2' > dataset2 = ds.dataset(csv_path, format=custom_csv_format) > # Write the 'dataset2' to the 'csv_path' base directory in the > # 'parquet' format, and overwrite/ignore if the file exists > ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', > existing_data_behavior='overwrite_or_ignore') > {code} > As previously stated, on first pass, the code works and creates a single > parquet file (part-0.parquet) with the correct data, row count, and schema. > However, if the code is run again, the following error is encountered: > {code:python} > ArrowInvalid: Could not open CSV input source > '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: > Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS > 305DEM030TTW �5HZ50GCVJV1CSV > {code} > My interpretation of the error is that on the second pass the 'dataset2' > variable now includes the 'part-0.parquet' file (which can be confirmed with > the `dataset2.files` output showing the file) and the CSV reader is > attempting to parse/read the parquet file. > If this is the case, is there an argument to ignore the parquet file and only > evaluate the CSV files? Also, if a dataset object has a format of 'csv' or > 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate > only CSV files and not all file types in the path. If that is not the current > behavior. > If this is not the case, any ideas on the cause or solution? > Any assistance would be greatly appreciated. > Thank you and have a great day! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16810) PyArrow: write_dataset - Could not open CSV input source
[ https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earle Lyons updated ARROW-16810: Summary: PyArrow: write_dataset - Could not open CSV input source (was: [Python] PyArrow: write_dataset - Could not open CSV input source) > PyArrow: write_dataset - Could not open CSV input source > > > Key: ARROW-16810 > URL: https://issues.apache.org/jira/browse/ARROW-16810 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 > Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 > Environment >Reporter: Earle Lyons >Priority: Minor > > Hi Arrow Community! > Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, > I am very excited about the project. > I am experiencing issues with the '{*}write_dataset'{*} function from the > '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, > I have searched the GitHub 'Issues', as well as Stack Overflow and I have not > identified a similar issue. > I have a directory that contains 90 CSV files (essentially one CSV for each > day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV > files into a dataset and write the dataset to a single Parquet file format. > Unfortunately, some of the CSV files contained nulls in some columns, which > presented some issues which were resolved by specifying DataTypes with the > following Stack Overflow solution: > [How do I specify a dtype for all columns when reading a CSV file with > pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] > The following code works on the first pass. > {code:python} > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as ds > import re > {code} > {code:python} > pa.__version__ > '8.0.0' > {code} > {code:python} > column_types = {} > csv_path = '/home/user/csv_files' > field_re_pattern = "value_*" > # Open a dataset with the 'csv_path' path and 'csv' file format > # and assign to 'dataset1' > dataset1 = ds.dataset(csv_path, format='csv') > # Loop through each field in the 'dataset1' schema, > # match the 'field_re_pattern' regex pattern in the field name, > # and assign 'int64' DataType to the field.name in the 'column_types' > # dictionary > for field in (field for field in dataset1.schema \ > if re.match(field_re_pattern, field.name)): > column_types[field.name] = pa.int64() > # Creates options for CSV data using the 'column_types' dictionary > # This returns a > convert_options = csv.ConvertOptions(column_types=column_types) > # Creates FileFormat for CSV using the 'convert_options' > # This returns a > custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) > # Open a a dataset with the 'csv_path' path, instead of using the > # 'csv' file format, use the 'custom_csv_format' and assign to > # 'dataset2' > dataset2 = ds.dataset(csv_path, format=custom_csv_format) > # Write the 'dataset2' to the 'csv_path' base directory in the > # 'parquet' format, and overwrite/ignore if the file exists > ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', > existing_data_behavior='overwrite_or_ignore') > {code} > As previously stated, on first pass, the code works and creates a single > parquet file (part-0.parquet) with the correct data, row count, and schema. > However, if the code is run again, the following error is encountered: > {code:python} > ArrowInvalid: Could not open CSV input source > '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: > Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS > 305DEM030TTW �5HZ50GCVJV1CSV > {code} > My interpretation of the error is that on the second pass the 'dataset2' > variable now includes the 'part-0.parquet' file (which can be confirmed with > the `dataset2.files` output showing the file) and the CSV reader is > attempting to parse/read the parquet file. > If this is the case, is there an argument to ignore the parquet file and only > evaluate the CSV files? Also, if a dataset object has a format of 'csv' or > 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate > only CSV files and not all file types in the path. If that is not the current > behavior. > If this is not the case, any ideas on the cause or solution? > Any assistance would be greatly appreciated. > Thank you and have a great day! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16810) [Python] PyArrow: write_dataset - Could not open CSV input source
[ https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earle Lyons updated ARROW-16810: Summary: [Python] PyArrow: write_dataset - Could not open CSV input source (was: PyArrow: write_dataset - Could not open CSV input source) > [Python] PyArrow: write_dataset - Could not open CSV input source > - > > Key: ARROW-16810 > URL: https://issues.apache.org/jira/browse/ARROW-16810 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 > Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 > Environment >Reporter: Earle Lyons >Priority: Minor > > Hi Arrow Community! > Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, > I am very excited about the project. > I am experiencing issues with the '{*}write_dataset'{*} function from the > '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, > I have searched the GitHub 'Issues', as well as Stack Overflow and I have not > identified a similar issue. > I have a directory that contains 90 CSV files (essentially one CSV for each > day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV > files into a dataset and write the dataset to a single Parquet file format. > Unfortunately, some of the CSV files contained nulls in some columns, which > presented some issues which were resolved by specifying DataTypes with the > following Stack Overflow solution: > [How do I specify a dtype for all columns when reading a CSV file with > pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] > The following code works on the first pass. > {code:python} > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as ds > import re > {code} > {code:python} > pa.__version__ > '8.0.0' > {code} > {code:python} > column_types = {} > csv_path = '/home/user/csv_files' > field_re_pattern = "value_*" > # Open a dataset with the 'csv_path' path and 'csv' file format > # and assign to 'dataset1' > dataset1 = ds.dataset(csv_path, format='csv') > # Loop through each field in the 'dataset1' schema, > # match the 'field_re_pattern' regex pattern in the field name, > # and assign 'int64' DataType to the field.name in the 'column_types' > # dictionary > for field in (field for field in dataset1.schema \ > if re.match(field_re_pattern, field.name)): > column_types[field.name] = pa.int64() > # Creates options for CSV data using the 'column_types' dictionary > # This returns a > convert_options = csv.ConvertOptions(column_types=column_types) > # Creates FileFormat for CSV using the 'convert_options' > # This returns a > custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) > # Open a a dataset with the 'csv_path' path, instead of using the > # 'csv' file format, use the 'custom_csv_format' and assign to > # 'dataset2' > dataset2 = ds.dataset(csv_path, format=custom_csv_format) > # Write the 'dataset2' to the 'csv_path' base directory in the > # 'parquet' format, and overwrite/ignore if the file exists > ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', > existing_data_behavior='overwrite_or_ignore') > {code} > As previously stated, on first pass, the code works and creates a single > parquet file (part-0.parquet) with the correct data, row count, and schema. > However, if the code is run again, the following error is encountered: > {code:python} > ArrowInvalid: Could not open CSV input source > '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: > Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS > 305DEM030TTW �5HZ50GCVJV1CSV > {code} > My interpretation of the error is that on the second pass the 'dataset2' > variable now includes the 'part-0.parquet' file (which can be confirmed with > the `dataset2.files` output showing the file) and the CSV reader is > attempting to parse/read the parquet file. > If this is the case, is there an argument to ignore the parquet file and only > evaluate the CSV files? Also, if a dataset object has a format of 'csv' or > 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate > only CSV files and not all file types in the path. If that is not the current > behavior. > If this is not the case, any ideas on the cause or solution? > Any assistance would be greatly appreciated. > Thank you and have a great day! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16811) [C++] Remove default exec context from Expression::Bind
[ https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552965#comment-17552965 ] Weston Pace commented on ARROW-16811: - [~rtpsw] Where would you see the "bind but don't support functions" variant being useful? I suppose I'm not quite sure I understand the intent. > [C++] Remove default exec context from Expression::Bind > --- > > Key: ARROW-16811 > URL: https://issues.apache.org/jira/browse/ARROW-16811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > This came up in https://github.com/apache/arrow/pull/13355. > It is maybe not very intuitive that Expression::Bind would require an > ExecContext and so we never provided one. However, when binding expressions > we need to lookup kernels, and that requires a function registry. Defaulting > to default_exec_context is something that should be done at a higher level > and so we should not allow ExecContext to be omitted when calling Bind. > Furthermore, [~rtpsw] has suggested that we might want to split > Expression::Bind into two variants. One which requires an ExecContext and > one which does not (but fails if it encounters a "call"). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16811) [C++] Remove default exec context from Expression::Bind
[ https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reassigned ARROW-16811: --- Assignee: Weston Pace > [C++] Remove default exec context from Expression::Bind > --- > > Key: ARROW-16811 > URL: https://issues.apache.org/jira/browse/ARROW-16811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > This came up in https://github.com/apache/arrow/pull/13355. > It is maybe not very intuitive that Expression::Bind would require an > ExecContext and so we never provided one. However, when binding expressions > we need to lookup kernels, and that requires a function registry. Defaulting > to default_exec_context is something that should be done at a higher level > and so we should not allow ExecContext to be omitted when calling Bind. > Furthermore, [~rtpsw] has suggested that we might want to split > Expression::Bind into two variants. One which requires an ExecContext and > one which does not (but fails if it encounters a "call"). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16811) [C++] Remove default exec context from Expression::Bind
[ https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16811: --- Labels: pull-request-available (was: ) > [C++] Remove default exec context from Expression::Bind > --- > > Key: ARROW-16811 > URL: https://issues.apache.org/jira/browse/ARROW-16811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This came up in https://github.com/apache/arrow/pull/13355. > It is maybe not very intuitive that Expression::Bind would require an > ExecContext and so we never provided one. However, when binding expressions > we need to lookup kernels, and that requires a function registry. Defaulting > to default_exec_context is something that should be done at a higher level > and so we should not allow ExecContext to be omitted when calling Bind. > Furthermore, [~rtpsw] has suggested that we might want to split > Expression::Bind into two variants. One which requires an ExecContext and > one which does not (but fails if it encounters a "call"). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16810) PyArrow: write_dataset - Could not open CSV input source
Earle Lyons created ARROW-16810: --- Summary: PyArrow: write_dataset - Could not open CSV input source Key: ARROW-16810 URL: https://issues.apache.org/jira/browse/ARROW-16810 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0 Environment Reporter: Earle Lyons Hi Arrow Community! Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, I am very excited about the project. I am experiencing issues with the '{*}write_dataset'{*} function from the '{*}dataset{*}' module. Please forgive me, if this is a known issue. However, I have searched the GitHub 'Issues', as well as Stack Overflow and I have not identified a similar issue. I have a directory that contains 90 CSV files (essentially one CSV for each day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV files into a dataset and write the dataset to a single Parquet file format. Unfortunately, some of the CSV files contained nulls in some columns, which presented some issues which were resolved by specifying DataTypes with the following Stack Overflow solution: [How do I specify a dtype for all columns when reading a CSV file with pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow] The following code works on the first pass. {code:python} import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as ds import re {code} {code:python} pa.__version__ '8.0.0' {code} {code:python} column_types = {} csv_path = '/home/user/csv_files' field_re_pattern = "value_*" # Open a dataset with the 'csv_path' path and 'csv' file format # and assign to 'dataset1' dataset1 = ds.dataset(csv_path, format='csv') # Loop through each field in the 'dataset1' schema, # match the 'field_re_pattern' regex pattern in the field name, # and assign 'int64' DataType to the field.name in the 'column_types' # dictionary for field in (field for field in dataset1.schema \ if re.match(field_re_pattern, field.name)): column_types[field.name] = pa.int64() # Creates options for CSV data using the 'column_types' dictionary # This returns a convert_options = csv.ConvertOptions(column_types=column_types) # Creates FileFormat for CSV using the 'convert_options' # This returns a custom_csv_format = ds.CsvFileFormat(convert_options=convert_options) # Open a a dataset with the 'csv_path' path, instead of using the # 'csv' file format, use the 'custom_csv_format' and assign to # 'dataset2' dataset2 = ds.dataset(csv_path, format=custom_csv_format) # Write the 'dataset2' to the 'csv_path' base directory in the # 'parquet' format, and overwrite/ignore if the file exists ds.write_dataset(dataset2, base_dir=csv_path, format='parquet', existing_data_behavior='overwrite_or_ignore') {code} As previously stated, on first pass, the code works and creates a single parquet file (part-0.parquet) with the correct data, row count, and schema. However, if the code is run again, the following error is encountered: {code:python} ArrowInvalid: Could not open CSV input source '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2: Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS 305DEM030TTW �5HZ50GCVJV1CSV {code} My interpretation of the error is that on the second pass the 'dataset2' variable now includes the 'part-0.parquet' file (which can be confirmed with the `dataset2.files` output showing the file) and the CSV reader is attempting to parse/read the parquet file. If this is the case, is there an argument to ignore the parquet file and only evaluate the CSV files? Also, if a dataset object has a format of 'csv' or 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate only CSV files and not all file types in the path. If that is not the current behavior. If this is not the case, any ideas on the cause or solution? Any assistance would be greatly appreciated. Thank you and have a great day! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16811) [C++] Remove default exec context from Expression::Bind
Weston Pace created ARROW-16811: --- Summary: [C++] Remove default exec context from Expression::Bind Key: ARROW-16811 URL: https://issues.apache.org/jira/browse/ARROW-16811 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace This came up in https://github.com/apache/arrow/pull/13355. It is maybe not very intuitive that Expression::Bind would require an ExecContext and so we never provided one. However, when binding expressions we need to lookup kernels, and that requires a function registry. Defaulting to default_exec_context is something that should be done at a higher level and so we should not allow ExecContext to be omitted when calling Bind. Furthermore, [~rtpsw] has suggested that we might want to split Expression::Bind into two variants. One which requires an ExecContext and one which does not (but fails if it encounters a "call"). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument
[ https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-16796. - Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13355 [https://github.com/apache/arrow/pull/13355] > [C++] Fix bad defaulting of ExecContext argument > > > Key: ARROW-16796 > URL: https://issues.apache.org/jira/browse/ARROW-16796 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > In several places in Arrow code, invocations of Expression::Bind() default > the ExecContext argument. This leads to the default function registry being > used in expression manipulations, and this becomes a problem when the user > wishes to use a non-default function registry, e.g., when passing one to the > ExecContext of an ExecPlan, which is how I discovered this issue. The > problematic places I found for such Expression::Bind() invocation are: > * cpp/src/arrow/dataset/file_parquet.cc > * cpp/src/arrow/dataset/scanner.cc > * cpp/src/arrow/compute/exec/project_node.cc > * cpp/src/arrow/compute/exec/hash_join_node.cc > * cpp/src/arrow/compute/exec/filter_node.cc > There are also other places in test and benchmark code (grep for 'Bind()'). > Another case of bad defaulting of an ExecContext argument is in > Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh > ExecContext is created, instead of being received from the caller, and passed > to BindNonRecursive. > I'd argue that an ExecContext variable should not be allowed to default, > except perhaps in the highest-level/user-facing APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552931#comment-17552931 ] Arkadiy Vertleyb commented on ARROW-16778: -- Okay, I will see what could be wrong with it. > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552923#comment-17552923 ] Will Jones commented on ARROW-16778: [~avertleyb] We actually do test 32-bit on MinGW in CI on every PR, just not on MSVC. It's likely there's just something wrong with the bit utility still; validity bitmaps are a fundamental part of Arrow Arrays, so it wouldn't be surprising at all that a single small issue in bitmap handling would break most tests. > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16746) [C++][Python] S3 tag support on write
[ https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552922#comment-17552922 ] Steve Loughran commented on ARROW-16746: yes. we use them a bit in the s3a committers, to annotate a zero byte marker file with the length they will finally ;get when manifest at their destination. in HADOOP-17833 that's beiing exposed in the createFile(path) buiilder api, where apps can set headers at create time. presumably gcs and azure could be wired up differently. they both have the advantage that you can edit file attributes after creation. > [C++][Python] S3 tag support on write > - > > Key: ARROW-16746 > URL: https://issues.apache.org/jira/browse/ARROW-16746 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: André Kelpe >Priority: Major > > S3 allows tagging data to better organize ones data > ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] > We use this for efficient downstream processes/inventory management. > Currently arrow/pyarrow does not allow tags to be added on write. This is > causing us to scan the bucket and re-apply the tags after a pyrrow based > process has run. > I looked through the code and think that it could potentially be done via the > metadata mechanism. > The tags need to be added to the CreateMultipartUploadRequest here: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156 > See also > http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups
[ https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel updated ARROW-16807: - Description: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') pa.compute.count_distinct(starwars.column('sex')).as_py() #> 15 pa.compute.unique(starwars.column('sex')) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. was: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. > [C++] count_distinct aggregates incorrectly across row groups > - > > Key: ARROW-16807 > URL: https://issues.apache.org/jira/browse/ARROW-16807 > Project: Apache Arrow > Issue Type: Bug > Environment: >
[jira] [Comment Edited] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552918#comment-17552918 ] Arkadiy Vertleyb edited comment on ARROW-16778 at 6/10/22 6:41 PM: --- [~willjones127] Depends. Besides crash, multiple tests fail. Let me ask you - in your best estimate, when was the last time someone ran 32 bit tests? It could be badly broken by now... I am afraid fixing that might be a major task, even for someone closely familiar with the system, let alone a new library user like myself. For your reference: Start 1: arrow-array-test 1/26 Test #1: arrow-array-test .***Failed0.24 sec Start 2: arrow-buffer-test 2/26 Test #2: arrow-buffer-test Passed0.03 sec Start 3: arrow-extension-type-test 3/26 Test #3: arrow-extension-type-test ***Failed0.02 sec Start 4: arrow-misc-test 4/26 Test #4: arrow-misc-test .. Passed0.05 sec Start 5: arrow-public-api-test 5/26 Test #5: arrow-public-api-test Passed0.02 sec Start 6: arrow-scalar-test 6/26 Test #6: arrow-scalar-test ***Failed0.06 sec Start 7: arrow-type-test 7/26 Test #7: arrow-type-test .. Passed0.15 sec Start 8: arrow-table-test 8/26 Test #8: arrow-table-test .***Failed0.05 sec Start 9: arrow-tensor-test 9/26 Test #9: arrow-tensor-test Passed0.02 sec Start 10: arrow-sparse-tensor-test 10/26 Test #10: arrow-sparse-tensor-test . Passed0.07 sec Start 11: arrow-stl-test 11/26 Test #11: arrow-stl-test ... Passed0.03 sec Start 12: arrow-random-test 12/26 Test #12: arrow-random-test Passed0.20 sec Start 13: arrow-json-integration-test 13/26 Test #13: arrow-json-integration-test ..***Failed0.15 sec Start 14: arrow-concatenate-test 14/26 Test #14: arrow-concatenate-test ...***Failed0.02 sec Start 15: arrow-c-bridge-test 15/26 Test #15: arrow-c-bridge-test ..***Failed0.06 sec Start 16: arrow-io-buffered-test 16/26 Test #16: arrow-io-buffered-test ... Passed0.08 sec Start 17: arrow-io-compressed-test 17/26 Test #17: arrow-io-compressed-test . Passed0.02 sec Start 18: arrow-io-file-test 18/26 Test #18: arrow-io-file-test ...***Failed 10.61 sec Start 19: arrow-io-memory-test 19/26 Test #19: arrow-io-memory-test .***Failed1.62 sec Start 20: arrow-utility-test 20/26 Test #20: arrow-utility-test ...***Failed3.00 sec Start 21: arrow-threading-utility-test 21/26 Test #21: arrow-threading-utility-test . Passed 39.77 sec Start 22: arrow-feather-test 22/26 Test #22: arrow-feather-test ...***Failed0.04 sec Start 23: arrow-ipc-json-simple-test 23/26 Test #23: arrow-ipc-json-simple-test ...***Failed0.06 sec Start 24: arrow-ipc-read-write-test 24/26 Test #24: arrow-ipc-read-write-test ***Failed8.17 sec Start 25: arrow-ipc-tensor-test 25/26 Test #25: arrow-ipc-tensor-test ***Failed1.16 sec Start 26: arrow-json-test 26/26 Test #26: arrow-json-test ..***Failed0.03 sec 42% tests passed, 15 tests failed out of 26 was (Author: JIRAUSER290619): [~willjones127] Depends. Besides crash, multiple tests fail. Let me ask you - in your best estimate, when was the last time someone ran 32 bit tests? It could be badly broken by now... I am afraid fixing that might be a major task, even for someone closely familiar with the system, let alone a library user like myself. For your reference: Start 1: arrow-array-test 1/26 Test #1: arrow-array-test .***Failed0.24 sec Start 2: arrow-buffer-test 2/26 Test #2: arrow-buffer-test Passed0.03 sec Start 3: arrow-extension-type-test 3/26 Test #3: arrow-extension-type-test ***Failed0.02 sec Start 4: arrow-misc-test 4/26 Test #4: arrow-misc-test .. Passed0.05 sec Start 5: arrow-public-api-test 5/26 Test #5: arrow-public-api-test Passed0.02 sec Start 6: arrow-scalar-test 6/26 Test #6: arrow-scalar-test ***Failed0.06 sec Start 7: arrow-type-test 7/26 Test #7: arrow-type-test .. Passed0.15 sec Start 8: arrow-table-test 8/26 Test #8: arrow-table-test .***Failed0.05 sec Start 9: arrow-tensor-test 9/26 Test #9: arrow-tensor-test Passed0.02 sec Start 10: arrow-sparse-tensor-test 10/26 Test #10: arrow-sparse-tensor-test . Passed0.07 sec Start 11:
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552918#comment-17552918 ] Arkadiy Vertleyb commented on ARROW-16778: -- [~willjones127] Depends. Besides crash, multiple tests fail. Let me ask you - in your best estimate, when was the last time someone ran 32 bit tests? It could be badly broken by now... I am afraid fixing that might be a major task, even for someone closely familiar with the system, let alone a library user like myself. For your reference: Start 1: arrow-array-test 1/26 Test #1: arrow-array-test .***Failed0.24 sec Start 2: arrow-buffer-test 2/26 Test #2: arrow-buffer-test Passed0.03 sec Start 3: arrow-extension-type-test 3/26 Test #3: arrow-extension-type-test ***Failed0.02 sec Start 4: arrow-misc-test 4/26 Test #4: arrow-misc-test .. Passed0.05 sec Start 5: arrow-public-api-test 5/26 Test #5: arrow-public-api-test Passed0.02 sec Start 6: arrow-scalar-test 6/26 Test #6: arrow-scalar-test ***Failed0.06 sec Start 7: arrow-type-test 7/26 Test #7: arrow-type-test .. Passed0.15 sec Start 8: arrow-table-test 8/26 Test #8: arrow-table-test .***Failed0.05 sec Start 9: arrow-tensor-test 9/26 Test #9: arrow-tensor-test Passed0.02 sec Start 10: arrow-sparse-tensor-test 10/26 Test #10: arrow-sparse-tensor-test . Passed0.07 sec Start 11: arrow-stl-test 11/26 Test #11: arrow-stl-test ... Passed0.03 sec Start 12: arrow-random-test 12/26 Test #12: arrow-random-test Passed0.20 sec Start 13: arrow-json-integration-test 13/26 Test #13: arrow-json-integration-test ..***Failed0.15 sec Start 14: arrow-concatenate-test 14/26 Test #14: arrow-concatenate-test ...***Failed0.02 sec Start 15: arrow-c-bridge-test 15/26 Test #15: arrow-c-bridge-test ..***Failed0.06 sec Start 16: arrow-io-buffered-test 16/26 Test #16: arrow-io-buffered-test ... Passed0.08 sec Start 17: arrow-io-compressed-test 17/26 Test #17: arrow-io-compressed-test . Passed0.02 sec Start 18: arrow-io-file-test 18/26 Test #18: arrow-io-file-test ...***Failed 10.61 sec Start 19: arrow-io-memory-test 19/26 Test #19: arrow-io-memory-test .***Failed1.62 sec Start 20: arrow-utility-test 20/26 Test #20: arrow-utility-test ...***Failed3.00 sec Start 21: arrow-threading-utility-test 21/26 Test #21: arrow-threading-utility-test . Passed 39.77 sec Start 22: arrow-feather-test 22/26 Test #22: arrow-feather-test ...***Failed0.04 sec Start 23: arrow-ipc-json-simple-test 23/26 Test #23: arrow-ipc-json-simple-test ...***Failed0.06 sec Start 24: arrow-ipc-read-write-test 24/26 Test #24: arrow-ipc-read-write-test ***Failed8.17 sec Start 25: arrow-ipc-tensor-test 25/26 Test #25: arrow-ipc-tensor-test ***Failed1.16 sec Start 26: arrow-json-test 26/26 Test #26: arrow-json-test ..***Failed0.03 sec 42% tests passed, 15 tests failed out of 26 > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16779) Request for Pyarrow Flight to be shipped in arm64 MacOS version of the wheel
[ https://issues.apache.org/jira/browse/ARROW-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552903#comment-17552903 ] Ajay Kanagala commented on ARROW-16779: --- Hi Team, We are currently blocked on this feature request. Can you please help with the timeline for the new feature request and also who do we reach out to , to get this feature request prioritized? > Request for Pyarrow Flight to be shipped in arm64 MacOS version of the wheel > > > Key: ARROW-16779 > URL: https://issues.apache.org/jira/browse/ARROW-16779 > Project: Apache Arrow > Issue Type: New Feature > Components: FlightRPC > Environment: Mac M1 OS, Python, >Reporter: Ajay Kanagala >Priority: Major > Labels: features > > This ticket is in continuation to previous ticket > "https://issues.apache.org/jira/browse/ARROW-13657; > It is found that Flight is not shipped in all versions of the wheel. we will > also get an import error if you attempt to import pyarrow.gandiva, which is > also an optional feature. It is turned off for arm64 MacOS here: > [https://github.com/apache/arrow/blob/8f0ddc785dd72e950b570f3bc380deb15c124c45/dev/tasks/python-wheels/github.osx.arm64.yml#L26] > > Our team uses Mac M1 processor to work on dremio driver and need access to > pyarrow package. > > Can you please add it to the wheel? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-13657) [Python] No module named 'pyarrow._flight' (MacOS)
[ https://issues.apache.org/jira/browse/ARROW-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552902#comment-17552902 ] Ajay Kanagala commented on ARROW-13657: --- Hi [~willjones127] , Thank you very much for your response. It is really helpful. I created a new ticket (ARROW-16779) in regards for Pyarrow Flight to be shipped in arm64 MacOS version of the wheel. What is the timeline for the new feature request and who do we reach out to , to get this feature request prioritized? > [Python] No module named 'pyarrow._flight' (MacOS) > -- > > Key: ARROW-13657 > URL: https://issues.apache.org/jira/browse/ARROW-13657 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 > Environment: Device: Macbook Air M1 2020 (MacOS Big Sur 11.5.1) > Python version: Python 3.9.4 > Arrow version: 5.0.0 >Reporter: Dinh Long Nguyen >Priority: Major > > ModuleNotFoundError: No module named 'pyarrow._flight' > *Error Detail:* > Traceback (most recent call last): > File "arrowtest1/backend/server.py", line 4, in > import pyarrow.flight as fl > File > ".local/share/virtualenvs/backend-OiVOEXti/lib/python3.9/site-packages/pyarrow/flight.py", > line 18, in > from pyarrow._flight import ( # noqa:F401 > ModuleNotFoundError: No module named 'pyarrow._flight' > *Device*: Macbook Air M1 2020 (MacOS Big Sur 11.5.1) > *Python version:* Python 3.9.4 > *Arrow version:* 5.0.0 > *Description* > Pyarrow works fine, can import other component such as pyarrow.orc, but not > pyarrow.flight > Tried out several machines (intel 4790k, ubuntu), can import pyarrow.flight > no problem. > Even tried out VSCode Dev Container on the same macbook, can also import > flight no problem. > But I cant import when used pipenv directly within macos (I mean no > container). > *Replication process:* > pipenv install pyarrow > python > >> import pyarrow.flight > Then the error occured > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552899#comment-17552899 ] Will Jones commented on ARROW-16778: {quote}This builds with three path parameters removed to use defaults and with my patch applied. {quote} Good! {quote}But then ctest crashes running arrow-io-file-test. {quote} Yup, same here. Would you like to continue debugging yourself? Or else I can look into it soon. > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552895#comment-17552895 ] Arkadiy Vertleyb edited comment on ARROW-16778 at 6/10/22 6:02 PM: --- [~willjones127] This builds with three path parameters removed to use defaults and with my patch applied. But then ctest crashes running arrow-io-file-test. was (Author: JIRAUSER290619): [~willjones127] This builds with three path parameters removed to use defaults and with my patch applied. But ctest crashes running arrow-io-file-test. > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16749) [Go] Bug when converting from Arrow to Parquet from null array
[ https://issues.apache.org/jira/browse/ARROW-16749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-16749. --- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13310 [https://github.com/apache/arrow/pull/13310] > [Go] Bug when converting from Arrow to Parquet from null array > -- > > Key: ARROW-16749 > URL: https://issues.apache.org/jira/browse/ARROW-16749 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Affects Versions: 9.0.0 >Reporter: Alexandre Crayssac >Assignee: Alexandre Crayssac >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Hello world, > When converting from Arrow to Parquet it looks like there is a bug with > arrays of type {{{}arrow.NULL{}}}. Here is a snippet of code to reproduce the > bug: > > {code:java} > package main > import ( > "fmt" > "log" > "os" > "github.com/apache/arrow/go/v9/arrow" > "github.com/apache/arrow/go/v9/arrow/array" > "github.com/apache/arrow/go/v9/arrow/memory" > "github.com/apache/arrow/go/v9/parquet/pqarrow" > ) > const n = 10 > func run() error { > schema := arrow.NewSchema( > []arrow.Field{ > {Name: "f1", Type: arrow.Null, Nullable: true}, > }, > nil, > ) > rb := array.NewRecordBuilder(memory.DefaultAllocator, schema) > defer rb.Release() > for i := 0; i < n; i++ { > rb.Field(0).(*array.NullBuilder).AppendNull() > } > rec := rb.NewRecord() > defer rec.Release() > for i, col := range rec.Columns() { > fmt.Printf("column[%d] %q: %v\n", i, rec.ColumnName(i), col) > } > f, err := os.Create("output.parquet") > if err != nil { > return err > } > defer f.Close() > w, err := pqarrow.NewFileWriter(rec.Schema(), f, nil, > pqarrow.DefaultWriterProps()) > if err != nil { > return err > } > defer w.Close() > err = w.Write(rec) > if err != nil { > return err > } > return nil > } > func main() { > if err := run(); err != nil { > log.Fatal(err) > } > } {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552895#comment-17552895 ] Arkadiy Vertleyb commented on ARROW-16778: -- [~willjones127] This builds with three path parameters removed to use defaults and with my patch applied. But ctest crashes running arrow-io-file-test. > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16809) [C++][Benchmarks] Create Filter Benchmark for Acero
[ https://issues.apache.org/jira/browse/ARROW-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16809: --- Labels: pull-request-available (was: ) > [C++][Benchmarks] Create Filter Benchmark for Acero > --- > > Key: ARROW-16809 > URL: https://issues.apache.org/jira/browse/ARROW-16809 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Ivan Chau >Assignee: Ivan Chau >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16809) [C++][Benchmarks] Create Filter Benchmark for Acero
[ https://issues.apache.org/jira/browse/ARROW-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Chau updated ARROW-16809: -- Summary: [C++][Benchmarks] Create Filter Benchmark for Acero (was: [Benchmarks] Create Filter Benchmark for Acero) > [C++][Benchmarks] Create Filter Benchmark for Acero > --- > > Key: ARROW-16809 > URL: https://issues.apache.org/jira/browse/ARROW-16809 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Ivan Chau >Assignee: Ivan Chau >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16809) [Benchmarks] Create Filter Benchmark for Acero
Ivan Chau created ARROW-16809: - Summary: [Benchmarks] Create Filter Benchmark for Acero Key: ARROW-16809 URL: https://issues.apache.org/jira/browse/ARROW-16809 Project: Apache Arrow Issue Type: Improvement Components: Benchmarking Reporter: Ivan Chau Assignee: Ivan Chau -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero
[ https://issues.apache.org/jira/browse/ARROW-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-16716. - Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13314 [https://github.com/apache/arrow/pull/13314] > [Benchmarks] Create Projection benchmark for Acero > -- > > Key: ARROW-16716 > URL: https://issues.apache.org/jira/browse/ARROW-16716 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Li Jin >Assignee: Ivan Chau >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Attachments: out, out_expression > > Time Spent: 5h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups
[ https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel updated ARROW-16807: - Description: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. was: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. > [C++] count_distinct aggregates incorrectly across row groups > - > > Key: ARROW-16807 > URL: https://issues.apache.org/jira/browse/ARROW-16807 > Project: Apache Arrow > Issue Type: Bug >
[jira] [Created] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups
Edward Visel created ARROW-16808: Summary: [C++] count_distinct aggregates incorrectly across row groups Key: ARROW-16808 URL: https://issues.apache.org/jira/browse/ARROW-16808 Project: Apache Arrow Issue Type: Bug Environment: > arrow::arrow_info() Arrow package version: 8.0.0.9000 Capabilities: datasetTRUE substrait FALSE parquetTRUE json TRUE s3 TRUE utf8proc TRUE re2TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4TRUE lz4_frame TRUE lzo FALSE bz2TRUE jemalloc TRUE mimalloc FALSE Memory: Allocator jemalloc Current37.25 Kb Max 925.42 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version9.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.1.6.13160021 Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 Reporter: Edward Visel Fix For: 9.0.0, 8.0.1 When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups
[ https://issues.apache.org/jira/browse/ARROW-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel closed ARROW-16808. Resolution: Duplicate Duplicate of [ARROW-16807] > [C++] count_distinct aggregates incorrectly across row groups > - > > Key: ARROW-16808 > URL: https://issues.apache.org/jira/browse/ARROW-16808 > Project: Apache Arrow > Issue Type: Bug > Environment: > arrow::arrow_info() > Arrow package version: 8.0.0.9000 > Capabilities: > > datasetTRUE > substrait FALSE > parquetTRUE > json TRUE > s3 TRUE > utf8proc TRUE > re2TRUE > snappy TRUE > gzip TRUE > brotli TRUE > zstd TRUE > lz4TRUE > lz4_frame TRUE > lzo FALSE > bz2TRUE > jemalloc TRUE > mimalloc FALSE > Memory: > > Allocator jemalloc > Current37.25 Kb > Max 925.42 Kb > Runtime: > > SIMD Level none > Detected SIMD Level none > Build: > > C++ Library Version9.0.0-SNAPSHOT > C++ Compiler AppleClang > C++ Compiler Version 13.1.6.13160021 > Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 >Reporter: Edward Visel >Priority: Blocker > Fix For: 9.0.0, 8.0.1 > > > When reading from parquet files with multiple row groups, {{count_distinct}} > (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: > {code:r} > library(dplyr, warn.conflicts = FALSE) > path <- tempfile(fileext = '.parquet') > arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) > ds <- arrow::open_dataset(path) > ds %>% count(sex) %>% collect() > #> # A tibble: 5 × 2 > #> sex n > #> > #> 1 male 60 > #> 2 none 6 > #> 3 female 16 > #> 4 hermaphroditic 1 > #> 5 4 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 19 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 16 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 16 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > # correct > ds %>% collect() %>% summarise(n = n_distinct(sex)) > #> # A tibble: 1 × 1 > #> n > #> > #> 1 5 > {code} > If the file is stored as a single row group, results are correct. When > grouped, results are correct. > I can reproduce this in Python as well using the same file and > {{pyarrow.compute.count_distinct}}: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > pa.__version__ > #> 8.0.0 > starwars = > pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') > print(pa.compute.count_distinct(starwars.column('sex')).as_py()) > #> 15 > print(pa.compute.unique(starwars.column('sex'))) > #> [ > #> "male", > #> "none", > #> "female", > #> "hermaphroditic", > #>null > #> ] > {code} > This seems likely to be the same problem in this StackOverflow question: > https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array > which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups
Edward Visel created ARROW-16807: Summary: [C++] count_distinct aggregates incorrectly across row groups Key: ARROW-16807 URL: https://issues.apache.org/jira/browse/ARROW-16807 Project: Apache Arrow Issue Type: Bug Environment: > arrow::arrow_info() Arrow package version: 8.0.0.9000 Capabilities: datasetTRUE substrait FALSE parquetTRUE json TRUE s3 TRUE utf8proc TRUE re2TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4TRUE lz4_frame TRUE lzo FALSE bz2TRUE jemalloc TRUE mimalloc FALSE Memory: Allocator jemalloc Current37.25 Kb Max 925.42 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version9.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.1.6.13160021 Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 Reporter: Edward Visel Fix For: 9.0.0, 8.0.1 When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16756) [C++] Introduce initial ArraySpan, ExecSpan non-owning / shared_ptr-free data structures for kernel execution, refactor scalar kernels
[ https://issues.apache.org/jira/browse/ARROW-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16756: --- Labels: pull-request-available (was: ) > [C++] Introduce initial ArraySpan, ExecSpan non-owning / shared_ptr-free data > structures for kernel execution, refactor scalar kernels > -- > > Key: ARROW-16756 > URL: https://issues.apache.org/jira/browse/ARROW-16756 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This is essential to reduce microperformance overhead as has been discussed > and investigated many other places. This first stage of work is to remove the > use of {{Datum}} and {{ExecBatch}} from the input side of only scalar > kernels, so that we can work toward using span/view data structures as the > inputs (and eventually outputs) of all kernels. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552863#comment-17552863 ] Will Jones commented on ARROW-16778: Wow that will save me a lot of time! :) > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552859#comment-17552859 ] Antoine Pitrou commented on ARROW-16778: Instead of disabling warnings one-by-one, you can change the {{BUILD_WARNING_LEVEL}} CMake variable. I don't know why we are not documenting it ([~kou] do you know?), but it is described thusly in {{cmake_modules/SetupCxxFlags.cmake}}: {code} # BUILD_WARNING_LEVEL add warning/error compiler flags. The possible values are # - PRODUCTION: Build with `-Wall` but do not add `-Werror`, so warnings do not # halt the build. # - CHECKIN: Build with `-Wall` and `-Wextra`. Also, add `-Werror` in debug mode # so that any important warnings fail the build. # - EVERYTHING: Like `CHECKIN`, but possible extra flags depending on the # compiler, including `-Wextra`, `-Weverything`, `-pedantic`. # This is the most aggressive warning level. # Defaults BUILD_WARNING_LEVEL to `CHECKIN`, unless CMAKE_BUILD_TYPE is # `RELEASE`, then it will default to `PRODUCTION`. The goal of defaulting to # `CHECKIN` is to avoid friction with long response time from CI. {code} > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16746) [C++][Python] S3 tag support on write
[ https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552857#comment-17552857 ] Antoine Pitrou commented on ARROW-16746: [~ste...@apache.org] Thanks for the information. What are "user attributes" in this context? Are you talking about "User-defined object metadata" as defined in [https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html] ? > [C++][Python] S3 tag support on write > - > > Key: ARROW-16746 > URL: https://issues.apache.org/jira/browse/ARROW-16746 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: André Kelpe >Priority: Major > > S3 allows tagging data to better organize ones data > ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] > We use this for efficient downstream processes/inventory management. > Currently arrow/pyarrow does not allow tags to be added on write. This is > causing us to scan the bucket and re-apply the tags after a pyrrow based > process has run. > I looked through the code and think that it could potentially be done via the > metadata mechanism. > The tags need to be added to the CreateMultipartUploadRequest here: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156 > See also > http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16778) [C++] 32 bit MSVC doesn't build
[ https://issues.apache.org/jira/browse/ARROW-16778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552853#comment-17552853 ] Will Jones commented on ARROW-16778: [~avertleyb] I've been able to get both of the following configurations to compile. Do you want to try them out and lmk if they work for you? {code:none} @REM Build using Ninja cmake.EXE ^ -DARROW_DEPENDENCY_SOURCE=BUNDLED ^ -DCMAKE_BUILD_TYPE=Debug ^ -DARROW_BUILD_TESTS=ON ^ "-DARROW_CXXFLAGS=/wd4244 /wd4554 /wd4018" ^ -DARROW_BUILD_INTEGRATION=OFF ^ -DARROW_EXTRA_ERROR_CONTEXT=ON ^ -DARROW_BUILD_STATIC=OFF ^ -DARROW_WITH_RE2=OFF ^ -DARROW_WITH_UTF8PROC=OFF ^ -DCMAKE_INSTALL_PREFIX=c:/Users/voltron/arrow/cpp/build/user-cpp-debug-win32-alt/dist ^ -S%USERPROFILE%/arrow/cpp ^ -B%USERPROFILE%/arrow/cpp/build/user-cpp-debug-win32 ^ -G Ninja @REM Or build using Visual Studio cmake.EXE ^ -DARROW_DEPENDENCY_SOURCE=BUNDLED ^ -DCMAKE_BUILD_TYPE=Debug ^ -DARROW_BUILD_TESTS=ON ^ "-DARROW_CXXFLAGS=/wd4244 /wd4554 /wd4018" ^ -DARROW_BUILD_INTEGRATION=OFF ^ -DARROW_EXTRA_ERROR_CONTEXT=ON ^ -DARROW_BUILD_STATIC=OFF ^ -DARROW_WITH_RE2=OFF ^ -DARROW_WITH_UTF8PROC=OFF ^ -DCMAKE_INSTALL_PREFIX=c:/Users/voltron/arrow/cpp/build/user-cpp-debug-win32-alt/dist ^ -S%USERPROFILE%/arrow/cpp ^ -B%USERPROFILE%/arrow/cpp/build/user-cpp-debug-win32-alt ^ -G "Visual Studio 16 2019" ^ -A Win32 {code} > [C++] 32 bit MSVC doesn't build > --- > > Key: ARROW-16778 > URL: https://issues.apache.org/jira/browse/ARROW-16778 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Win32, MSVC >Reporter: Arkadiy Vertleyb >Priority: Major > > When specifying Win32 as a platform, and building with MSVC, the build fails > with the following compile errors : > {noformat} > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(70,59): error > C3861: '__popcnt64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(204,7): error > C3861: '_BitScanReverse64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > C:\Users\avertleyb\git\arrow\cpp\src\arrow/util/bit_util.h(250,7): error > C3861: '_BitScanForward64': identifier not found > [C:\Users\avertleyb\git\arrow\cpp\build32\src\arrow\arrow_shared.vcxproj] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero
[ https://issues.apache.org/jira/browse/ARROW-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Chau reassigned ARROW-16716: - Assignee: Ivan Chau > [Benchmarks] Create Projection benchmark for Acero > -- > > Key: ARROW-16716 > URL: https://issues.apache.org/jira/browse/ARROW-16716 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Li Jin >Assignee: Ivan Chau >Priority: Major > Labels: pull-request-available > Attachments: out, out_expression > > Time Spent: 5h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542244#comment-17542244 ] Antoine Pitrou edited comment on ARROW-15678 at 6/10/22 3:04 PM: - On further investigation, we can include immintrin.h with or without -mavx2 and clang at least will not complain unless the intrinsics are referenced, so {code} #include [[gnu::target("avx2")]] void use_simd() { __m256i arg; _mm256_abs_epi16 (arg); } int main() { use_simd(); } {code} compiles and runs happily without any special compilation flags. Using an attribute like this seems viable provided we can be certain that the modified target isn't transitively applied to functions which might be invoked for the first time inside a SIMD enabled function was (Author: bkietz): On further investigation, we can include immintrin.h with or without -mavx2 and clang at least will not complain unless the intrinsics are referenced, so {{code}} #include [[gnu::target("avx2")]] void use_simd() { __m256i arg; _mm256_abs_epi16 (arg); } int main() { use_simd(); } {{code}} compiles and runs happily without any special compilation flags. Using an attribute like this seems viable provided we can be certain that the modified target isn't transitively applied to functions which might be invoked for the first time inside a SIMD enabled function > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16759) [Go]
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-16759. --- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13322 [https://github.com/apache/arrow/pull/13322] > [Go] > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16746) [C++][Python] S3 tag support on write
[ https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552769#comment-17552769 ] Steve Loughran commented on ARROW-16746: hadoop s3a maps user attributes to the filesystem XAttr APIs, very soon to let you also set them when you create a file. > [C++][Python] S3 tag support on write > - > > Key: ARROW-16746 > URL: https://issues.apache.org/jira/browse/ARROW-16746 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: André Kelpe >Priority: Major > > S3 allows tagging data to better organize ones data > ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] > We use this for efficient downstream processes/inventory management. > Currently arrow/pyarrow does not allow tags to be added on write. This is > causing us to scan the bucket and re-apply the tags after a pyrrow based > process has run. > I looked through the code and think that it could potentially be done via the > metadata mechanism. > The tags need to be added to the CreateMultipartUploadRequest here: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156 > See also > http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-5356) [JS] Implement Duration type, integration test support for Interval and Duration types
[ https://issues.apache.org/jira/browse/ARROW-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552763#comment-17552763 ] Lukas Masuch commented on ARROW-5356: - I'm also running into the same problem ('Unrecognized type: "Duration" (18)'). Looking into the code, it seems that the Duration type is only partially implemented in javascript/typescript. For example, it fails to decode the duration type because it is missing in this switch case: [https://github.com/apache/arrow/blob/apache-arrow-9.0.0.dev/js/src/ipc/metadata/message.ts#L440] Not sure if this is intentional, a bug, or just not implemented yet? > [JS] Implement Duration type, integration test support for Interval and > Duration types > -- > > Key: ARROW-5356 > URL: https://issues.apache.org/jira/browse/ARROW-5356 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Wes McKinney >Assignee: Brian Hulette >Priority: Major > > Follow on work to ARROW-835 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
[ https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552760#comment-17552760 ] Raúl Cumplido commented on ARROW-16801: --- I have tried pinning an old version of minio as seen on this testing PR ([https://github.com/apache/arrow/pull/13362]) but is not working. It seems the only available version at brew is the latest one: {code:java} % brew info minio minio: stable 20220508235031 (bottled), HEAD High Performance, Kubernetes Native Object Storage https://min.io /opt/homebrew/Cellar/minio/20210722052332 (7 files, 91.8MB) * Poured from bottle on 2021-07-26 at 11:06:45 From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/minio.rb License: AGPL-3.0-or-later ==> Dependencies Build: go ✘ ==> Options --HEAD Install HEAD version ==> Caveats To restart minio after an upgrade: brew services restart minio Or, if you don't want/need a background service you can just run: /opt/homebrew/opt/minio/bin/minio server --config-dir=/opt/homebrew/etc/minio --address=:9000 /opt/homebrew/var/minio ==> Analytics install: 1,535 (30 days), 5,097 (90 days), 19,136 (365 days) install-on-request: 1,534 (30 days), 5,097 (90 days), 19,109 (365 days) build-error: 0 (30 days) % brew search minio ==> Formulae minio ✔ minio-mc minicom minipro minica minbif mint ==> Casks min miniwol {code} [~kou] do you know if there is a way of accessing a previous formula on brew? > [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job > --- > > Key: ARROW-16801 > URL: https://issues.apache.org/jira/browse/ARROW-16801 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Anthony Louis Gotlib Ferreira >Priority: Major > > Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the > *AMD64 MacOS 10.15 C++* job. > The error message is big, but one example is below: > {code:java} > [ OK ] TestS3FS.GetFileInfoGenerator (55 ms) > 234[ RUN ] TestS3FS.CreateDir > 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: > Failure > 236Failed > 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got > OK > 238* Closing connection 0 > 239* Closing connection 0 > 240[ FAILED ] TestS3FS.CreateDir (113 ms) > 241[ RUN ] TestS3FS.DeleteFile > 242* Closing connection 0 {code} > Here is a PR where that test failed: > [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
[ https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552739#comment-17552739 ] Raúl Cumplido commented on ARROW-16801: --- I missed that we were installing minio from brew on some of the MAC jobs: [https://github.com/apache/arrow/blob/master/cpp/Brewfile#L32] We probably should install it manually from a specific version as we do on the other jobs. > [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job > --- > > Key: ARROW-16801 > URL: https://issues.apache.org/jira/browse/ARROW-16801 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Anthony Louis Gotlib Ferreira >Priority: Major > > Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the > *AMD64 MacOS 10.15 C++* job. > The error message is big, but one example is below: > {code:java} > [ OK ] TestS3FS.GetFileInfoGenerator (55 ms) > 234[ RUN ] TestS3FS.CreateDir > 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: > Failure > 236Failed > 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got > OK > 238* Closing connection 0 > 239* Closing connection 0 > 240[ FAILED ] TestS3FS.CreateDir (113 ms) > 241[ RUN ] TestS3FS.DeleteFile > 242* Closing connection 0 {code} > Here is a PR where that test failed: > [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
[ https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552729#comment-17552729 ] Antoine Pitrou commented on ARROW-16801: The CI failure is probably related to a new Minio version. We already pin Minio on other CI jobs to avoid such issues: https://github.com/apache/arrow/blob/master/ci/scripts/install_minio.sh#L54 > [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job > --- > > Key: ARROW-16801 > URL: https://issues.apache.org/jira/browse/ARROW-16801 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Anthony Louis Gotlib Ferreira >Priority: Major > > Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the > *AMD64 MacOS 10.15 C++* job. > The error message is big, but one example is below: > {code:java} > [ OK ] TestS3FS.GetFileInfoGenerator (55 ms) > 234[ RUN ] TestS3FS.CreateDir > 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: > Failure > 236Failed > 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got > OK > 238* Closing connection 0 > 239* Closing connection 0 > 240[ FAILED ] TestS3FS.CreateDir (113 ms) > 241[ RUN ] TestS3FS.DeleteFile > 242* Closing connection 0 {code} > Here is a PR where that test failed: > [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
[ https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552726#comment-17552726 ] Antoine Pitrou commented on ARROW-16801: [~anthonylouis] It is not blocking PRs before maintainers can ignore unrelated CI failures. > [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job > --- > > Key: ARROW-16801 > URL: https://issues.apache.org/jira/browse/ARROW-16801 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Anthony Louis Gotlib Ferreira >Priority: Major > > Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the > *AMD64 MacOS 10.15 C++* job. > The error message is big, but one example is below: > {code:java} > [ OK ] TestS3FS.GetFileInfoGenerator (55 ms) > 234[ RUN ] TestS3FS.CreateDir > 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: > Failure > 236Failed > 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got > OK > 238* Closing connection 0 > 239* Closing connection 0 > 240[ FAILED ] TestS3FS.CreateDir (113 ms) > 241[ RUN ] TestS3FS.DeleteFile > 242* Closing connection 0 {code} > Here is a PR where that test failed: > [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16751) [C++] Unify target include directories
[ https://issues.apache.org/jira/browse/ARROW-16751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552718#comment-17552718 ] Jacob Wujciak-Jens commented on ARROW-16751: +1 for bumping the minimum since 3.5 a lot of [useful functionality|https://cliutils.gitlab.io/modern-cmake/chapters/intro/newcmake.html] was introduced. Side note: do we have CI jobs that use 3.5 to make sure that we are actually supporting it? > [C++] Unify target include directories > -- > > Key: ARROW-16751 > URL: https://issues.apache.org/jira/browse/ARROW-16751 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Priority: Major > Fix For: 9.0.0 > > > Context: [https://github.com/apache/arrow/pull/13244#discussion_r889780669] > {{target_include_directories()}} in cmake 3.10 or earlier doesn't support > {{INTERFACE}} against {{IMPORTED}} target, so we have to check cmake version > like below: > {code:java} > if(CMAKE_VERSION VERSION_LESS 3.11) > set_target_properties(xsimd PROPERTIES INTERFACE_INCLUDE_DIRECTORIES > "${XSIMD_INCLUDE_DIR}") > else() > target_include_directories(xsimd INTERFACE "${XSIMD_INCLUDE_DIR}") > endif() > {code} > Above code is duplicated for some targets. There are also some targets (e.g. > ucx::ucx) missed the check. > We can add a function > {{arrow_imported_target_interface_include_directories()}} to make it simpler. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (ARROW-16805) [R] R crashing with Apple M1 chip
[ https://issues.apache.org/jira/browse/ARROW-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Henriques closed ARROW-16805. - Resolution: Fixed Solution is to install R using the *Apple silicon arm64* build instead of the Intel build . > [R] R crashing with Apple M1 chip > - > > Key: ARROW-16805 > URL: https://issues.apache.org/jira/browse/ARROW-16805 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: MacBook Pro 13-inch, M1, 2020 > R version 4.1.1 (2021-08-10) > Platform: x86_64-apple-darwin17.0 (64-bit) > Running under: macOS Monterey 12.1 >Reporter: Gil Henriques >Priority: Major > > When using the \{arrow} package, R crashes as soon as a dplyr verb is used on > a parquet object. This does not happen on Windows computers, but I have > reproduced it using two separate MacBook Pros with an Apple M1 chip. Crash > happens both with RStudio and running R in the command line. > The reprex below is based on the vignette for \{arrow}, available at > [https://arrow.apache.org/docs/r/:] > > {{library(arrow)}} > {{library(dplyr)}} > {{write_parquet(starwars, sink = 'sw_parquet')}} > {{sw <- read_parquet(file = 'sw_parquet', as_data_frame = FALSE)}} > {{result <- sw %>%}} > {{ filter(homeworld == "Tatooine")}} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16801) [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job
[ https://issues.apache.org/jira/browse/ARROW-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552679#comment-17552679 ] Anthony Louis Gotlib Ferreira commented on ARROW-16801: --- [~kou] [~apitrou] do you have any idea who can I ask help to fix that problem? Because it is blocking some PR's to be merged as the CI is not gree > [CI][C++] Error in arrow-s3fs-test on AMD64 MacOS 10.15 C++ job > --- > > Key: ARROW-16801 > URL: https://issues.apache.org/jira/browse/ARROW-16801 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Anthony Louis Gotlib Ferreira >Priority: Major > > Since some days ago, all C++ PR's are failing in _arrow-s3fs-test_ for the > *AMD64 MacOS 10.15 C++* job. > The error message is big, but one example is below: > {code:java} > [ OK ] TestS3FS.GetFileInfoGenerator (55 ms) > 234[ RUN ] TestS3FS.CreateDir > 235/Users/runner/work/arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:798: > Failure > 236Failed > 237Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got > OK > 238* Closing connection 0 > 239* Closing connection 0 > 240[ FAILED ] TestS3FS.CreateDir (113 ms) > 241[ RUN ] TestS3FS.DeleteFile > 242* Closing connection 0 {code} > Here is a PR where that test failed: > [https://github.com/apache/arrow/runs/6816324194?check_suite_focus=true] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16806) [CI][Python] Verify source script fails on Ubuntu 18.04 due to old setuptools version
[ https://issues.apache.org/jira/browse/ARROW-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16806: --- Labels: pull-request-available (was: ) > [CI][Python] Verify source script fails on Ubuntu 18.04 due to old setuptools > version > - > > Key: ARROW-16806 > URL: https://issues.apache.org/jira/browse/ARROW-16806 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently the verify-rc-source-python-linux-ubuntu-18.04-amd64 job is failing > (https://github.com/ursacomputing/crossbow/runs/6814290999?check_suite_focus=true) > due to: > {code:java} > File "setup.py", line 37, in > from setuptools import setup, Extension, Distribution, > find_namespace_packages > ImportError: cannot import name 'find_namespace_packages' from 'setuptools' > (/tmp/arrow-HEAD.kvwV0/venv-source/lib/python3.8/site-packages/setuptools/__init__.py) > {code} > This change was introduced on this PR > [https://github.com/apache/arrow/pull/13309] to fix > https://issues.apache.org/jira/browse/ARROW-16726. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16806) [CI][Python] Verify source script fails on Ubuntu 18.04 due to old setuptools version
Raúl Cumplido created ARROW-16806: - Summary: [CI][Python] Verify source script fails on Ubuntu 18.04 due to old setuptools version Key: ARROW-16806 URL: https://issues.apache.org/jira/browse/ARROW-16806 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Assignee: Raúl Cumplido Fix For: 9.0.0 Currently the verify-rc-source-python-linux-ubuntu-18.04-amd64 job is failing (https://github.com/ursacomputing/crossbow/runs/6814290999?check_suite_focus=true) due to: {code:java} File "setup.py", line 37, in from setuptools import setup, Extension, Distribution, find_namespace_packages ImportError: cannot import name 'find_namespace_packages' from 'setuptools' (/tmp/arrow-HEAD.kvwV0/venv-source/lib/python3.8/site-packages/setuptools/__init__.py) {code} This change was introduced on this PR [https://github.com/apache/arrow/pull/13309] to fix https://issues.apache.org/jira/browse/ARROW-16726. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16799) [C++] Create a signal-safe self-pipe abstraction
[ https://issues.apache.org/jira/browse/ARROW-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai resolved ARROW-16799. -- Resolution: Fixed Issue resolved by pull request 13354 [https://github.com/apache/arrow/pull/13354] > [C++] Create a signal-safe self-pipe abstraction > > > Key: ARROW-16799 > URL: https://issues.apache.org/jira/browse/ARROW-16799 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > A signal-safe self-pipe is already used in the Flight server in order to > shutdown the server on an incoming SIGINT. > We should refactor this to expose a reusable abstraction. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16805) [R] R crashing with Apple M1 chip
Gil Henriques created ARROW-16805: - Summary: [R] R crashing with Apple M1 chip Key: ARROW-16805 URL: https://issues.apache.org/jira/browse/ARROW-16805 Project: Apache Arrow Issue Type: Bug Components: R Environment: MacBook Pro 13-inch, M1, 2020 R version 4.1.1 (2021-08-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.1 Reporter: Gil Henriques When using the \{arrow} package, R crashes as soon as a dplyr verb is used on a parquet object. This does not happen on Windows computers, but I have reproduced it using two separate MacBook Pros with an Apple M1 chip. Crash happens both with RStudio and running R in the command line. The reprex below is based on the vignette for \{arrow}, available at [https://arrow.apache.org/docs/r/:] {{library(arrow)}} {{library(dplyr)}} {{write_parquet(starwars, sink = 'sw_parquet')}} {{sw <- read_parquet(file = 'sw_parquet', as_data_frame = FALSE)}} {{result <- sw %>%}} {{ filter(homeworld == "Tatooine")}} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16543) [JS] Timestamp types are all the same
[ https://issues.apache.org/jira/browse/ARROW-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552616#comment-17552616 ] Teodor Kostov commented on ARROW-16543: --- Here is a full test: {code:javascript} const dataType = new arrow.Struct([ new arrow.Field('time', new arrow.TimestampSecond()), new arrow.Field('value', new arrow.Float64()), ]) const builder = arrow.makeBuilder({ type: dataType, nullValues: [null, undefined] }) const date = new Date() const timestampSeconds = Math.floor(date.getTime() / 1000) const timestamp = timestampSeconds * 1000 builder.append({ time: date, value: 1.2 }) builder.append({ time: date, value: 3.3 }) builder.finish() const vector = builder.toVector() const schema = new arrow.Schema(dataType.children) const recordBatch = new arrow.RecordBatch(schema, vector.data[0]) const table = new arrow.Table(recordBatch) console.log(timestamp) console.log(timestampSeconds) console.log(table.get(0).time) console.log(table.get(0).time === timestamp) // should be false console.log(table.get(0).time === timestampSeconds) // should be true {code} > [JS] Timestamp types are all the same > - > > Key: ARROW-16543 > URL: https://issues.apache.org/jira/browse/ARROW-16543 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Teodor Kostov >Priority: Major > > Current timestamp types are all the same. They have the same representation. > And also the same precision. > For example, {{TimestampSecond}} and {{TimestampMillisecond}} return the > values as {{165211818}}. Instead, I would expect the {{TimestampSecond}} > to drop the 3 zeros when returning a value, e.g. {{1652118180}}. Also, the > representation underneath is still an {{int32}} array. Even though for > {{TimestampSecond}} every second value is {{0}}, the array still has double > the amount of integers. > I also got an error when trying to read a {{Date}} as {{TimestampNanosecond}} > - {{TypeError: can't convert 165211818 to BigInt}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16804) [CI][Conan] Merge upstream changes
[ https://issues.apache.org/jira/browse/ARROW-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16804: --- Labels: pull-request-available (was: ) > [CI][Conan] Merge upstream changes > -- > > Key: ARROW-16804 > URL: https://issues.apache.org/jira/browse/ARROW-16804 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16804) [CI][Conan] Merge upstream changes
Kouhei Sutou created ARROW-16804: Summary: [CI][Conan] Merge upstream changes Key: ARROW-16804 URL: https://issues.apache.org/jira/browse/ARROW-16804 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument
[ https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552569#comment-17552569 ] Yaron Gvili commented on ARROW-16796: - If coding safety is a major concern here (IMHO it is), I'd suggest that in the longer-term Arrow code should distinguish between simplification of expressions with and without functions/execution, where only the former requires an ExecContext whereas only the latter will fail if a function exists in the expression. Perhaps the simplest, though likely not ideal, code-change for this is by defaulting ExecContext to an implementation that fails. The purpose of the PR is just to fix in the short-term. Follow-up issues can be created for what remains. > [C++] Fix bad defaulting of ExecContext argument > > > Key: ARROW-16796 > URL: https://issues.apache.org/jira/browse/ARROW-16796 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > In several places in Arrow code, invocations of Expression::Bind() default > the ExecContext argument. This leads to the default function registry being > used in expression manipulations, and this becomes a problem when the user > wishes to use a non-default function registry, e.g., when passing one to the > ExecContext of an ExecPlan, which is how I discovered this issue. The > problematic places I found for such Expression::Bind() invocation are: > * cpp/src/arrow/dataset/file_parquet.cc > * cpp/src/arrow/dataset/scanner.cc > * cpp/src/arrow/compute/exec/project_node.cc > * cpp/src/arrow/compute/exec/hash_join_node.cc > * cpp/src/arrow/compute/exec/filter_node.cc > There are also other places in test and benchmark code (grep for 'Bind()'). > Another case of bad defaulting of an ExecContext argument is in > Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh > ExecContext is created, instead of being received from the caller, and passed > to BindNonRecursive. > I'd argue that an ExecContext variable should not be allowed to default, > except perhaps in the highest-level/user-facing APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument
[ https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552568#comment-17552568 ] Yaron Gvili commented on ARROW-16796: - Copying [Weston Pace's post|https://github.com/apache/arrow/pull/13355#issuecomment-1151679039]: Good catch. I wonder if we should remove the default argument to bind entirely (it would look something like [westonpace@{{{}c9ae1dd{}}}|https://github.com/westonpace/arrow/commit/c9ae1dd6a0857af69e48a95ec76480f4c466791e] ). Looks like there are only a few other non-test spots we call bind. * In the parquet reader we convert statistics into expressions and bind them to the schema. These expressions will only use min/max and it's only really for simplification and not execution so we're probably ok. * The scanner has a number of methods that create exec plans (this is the "lightweight producer" half of the scanner). We could arguably add an ExecContext to scan options but I think it would better to start phasing out this half of the scanner in favor of direct use of exec plans instead. > [C++] Fix bad defaulting of ExecContext argument > > > Key: ARROW-16796 > URL: https://issues.apache.org/jira/browse/ARROW-16796 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > In several places in Arrow code, invocations of Expression::Bind() default > the ExecContext argument. This leads to the default function registry being > used in expression manipulations, and this becomes a problem when the user > wishes to use a non-default function registry, e.g., when passing one to the > ExecContext of an ExecPlan, which is how I discovered this issue. The > problematic places I found for such Expression::Bind() invocation are: > * cpp/src/arrow/dataset/file_parquet.cc > * cpp/src/arrow/dataset/scanner.cc > * cpp/src/arrow/compute/exec/project_node.cc > * cpp/src/arrow/compute/exec/hash_join_node.cc > * cpp/src/arrow/compute/exec/filter_node.cc > There are also other places in test and benchmark code (grep for 'Bind()'). > Another case of bad defaulting of an ExecContext argument is in > Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh > ExecContext is created, instead of being received from the caller, and passed > to BindNonRecursive. > I'd argue that an ExecContext variable should not be allowed to default, > except perhaps in the highest-level/user-facing APIs. -- This message was sent by Atlassian Jira (v8.20.7#820007)