[jira] [Updated] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries
[ https://issues.apache.org/jira/browse/ARROW-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18121: --- Labels: pull-request-available (was: ) > [Release][CI] Use Ubuntu 22.04 for verifying binaries > - > > Key: ARROW-18121 > URL: https://issues.apache.org/jira/browse/ARROW-18121 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > APT/Yum verifications use Docker. If we use old libseccomp on host, some > operations may cause errors: > e.g.: > https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 > {noformat} > + valac --pkg arrow-glib --pkg posix build.vala > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries
[ https://issues.apache.org/jira/browse/ARROW-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-18121: - Description: APT/Yum verifications use Docker. If we use old libseccomp on host, some operations may cause errors: e.g.: https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 {noformat} + valac --pkg arrow-glib --pkg posix build.vala error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) {noformat} was: APT/Yum verifications use Docker. If we use some old version packages (I can't remember that what packages have a problem...) on host, some operations may cause errors: e.g.: https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 {noformat} + valac --pkg arrow-glib --pkg posix build.vala error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) {noformat} > [Release][CI] Use Ubuntu 22.04 for verifying binaries > - > > Key: ARROW-18121 > URL: https://issues.apache.org/jira/browse/ARROW-18121 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > > APT/Yum verifications use Docker. If we use old libseccomp on host, some > operations may cause errors: > e.g.: > https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 > {noformat} > + valac --pkg arrow-glib --pkg posix build.vala > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries
Kouhei Sutou created ARROW-18121: Summary: [Release][CI] Use Ubuntu 22.04 for verifying binaries Key: ARROW-18121 URL: https://issues.apache.org/jira/browse/ARROW-18121 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou APT/Yum verifications use Docker. If we use some old version packages (I can't remember that what packages have a problem...) on host, some operations may cause errors: e.g.: https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 {format} + valac --pkg arrow-glib --pkg posix build.vala error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) {format} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18121) [Release][CI] Use Ubuntu 22.04 for verifying binaries
[ https://issues.apache.org/jira/browse/ARROW-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-18121: - Description: APT/Yum verifications use Docker. If we use some old version packages (I can't remember that what packages have a problem...) on host, some operations may cause errors: e.g.: https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 {noformat} + valac --pkg arrow-glib --pkg posix build.vala error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) {noformat} was: APT/Yum verifications use Docker. If we use some old version packages (I can't remember that what packages have a problem...) on host, some operations may cause errors: e.g.: https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 {format} + valac --pkg arrow-glib --pkg posix build.vala error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) error: Failed to close file descriptor for child process (Operation not permitted) {format} > [Release][CI] Use Ubuntu 22.04 for verifying binaries > - > > Key: ARROW-18121 > URL: https://issues.apache.org/jira/browse/ARROW-18121 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > > APT/Yum verifications use Docker. If we use some old version packages (I > can't remember that what packages have a problem...) on host, some operations > may cause errors: > e.g.: > https://github.com/ursacomputing/crossbow/actions/runs/3294870946/jobs/5432835953#step:7:5437 > {noformat} > + valac --pkg arrow-glib --pkg posix build.vala > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > error: Failed to close file descriptor for child process (Operation not > permitted) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18093) [CI][Conda][Windows] Failed with missing ORC
[ https://issues.apache.org/jira/browse/ARROW-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-18093. -- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14454 [https://github.com/apache/arrow/pull/14454] > [CI][Conda][Windows] Failed with missing ORC > > > Key: ARROW-18093 > URL: https://issues.apache.org/jira/browse/ARROW-18093 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=37759=logs=4c86bc1b-1091-5192-4404-c74dfaad23e7=41795ef0-6501-5db4-3ad4-33c0cf085626=497 > {noformat} > CMake Error at cmake_modules/FindORC.cmake:56 (message): > ORC library was required in toolchain and unable to locate > Call Stack (most recent call first): > cmake_modules/ThirdpartyToolchain.cmake:280 (find_package) > cmake_modules/ThirdpartyToolchain.cmake:4362 (resolve_dependency) > CMakeLists.txt:496 (include) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18120) [Release][Dev] Run binaries/wheels verifications in 05-binary-upload.sh
[ https://issues.apache.org/jira/browse/ARROW-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18120: --- Labels: pull-request-available (was: ) > [Release][Dev] Run binaries/wheels verifications in 05-binary-upload.sh > --- > > Key: ARROW-18120 > URL: https://issues.apache.org/jira/browse/ARROW-18120 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We have a script (02-source.sh) that runs source verifications. > But we don't have a script that runs binaries/wheels verifications yet. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18120) [Release][Dev] Run binaries/wheels verifications in 05-binary-upload.sh
Kouhei Sutou created ARROW-18120: Summary: [Release][Dev] Run binaries/wheels verifications in 05-binary-upload.sh Key: ARROW-18120 URL: https://issues.apache.org/jira/browse/ARROW-18120 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou We have a script (02-source.sh) that runs source verifications. But we don't have a script that runs binaries/wheels verifications yet. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18091) [Ruby] Arrow::Table#join returns duplicated key columns
[ https://issues.apache.org/jira/browse/ARROW-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621484#comment-17621484 ] Hirokazu SUZUKI commented on ARROW-18091: - I mean dplyr's join( ..., keep=FALSE) behavior. Sorry for the poor explanation. > [Ruby] Arrow::Table#join returns duplicated key columns > --- > > Key: ARROW-18091 > URL: https://issues.apache.org/jira/browse/ARROW-18091 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Reporter: Hirokazu SUZUKI >Priority: Major > > `Arrow::Table#join` returns columns with duplicate keys. Duplicate column > names are acceptable in Arrow, but it is preferable to use one. > Also with `type: :full_outer`, column data should be merged. > table1 > => > # > KEY X > 0 A 1 > 1 B 2 > 2 C 3 > table2 > => > # > KEY X > 0 A 4 > 1 B 5 > 2 D 6 > > Should omit `:KEY` in right > table1.join(table2, :KEY) > => > # > KEY X KEY X > 0 A 1 A 4 > 1 B 2 B 5 > > Should merge `:KEY`s > table1.join(table2, :KEY, type: :full_outer) > => > # > KEY X KEY X > 0 A 1 A 4 > 1 B 2 B 5 > 2 C 3 (null) (null) > 3 (null) (null) D 6 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18119) [C++] Utility method to ensure an array object meetings an alignment requirement
Weston Pace created ARROW-18119: --- Summary: [C++] Utility method to ensure an array object meetings an alignment requirement Key: ARROW-18119 URL: https://issues.apache.org/jira/browse/ARROW-18119 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace This would look something like: EnsureAligned(Buffer|Array|ChunkedArray|RecordBatch|Table, int minimum_alignment, MemoryPool* memory_pool); It would fail if MemoryPool's alignment < minimum_alignment It would iterate through each buffer of the object, if the object is not aligned properly, it would reallocate and copy the buffer (using memory_pool) It would return a new object where every buffer is guaranteed to meet the alignment requirements. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-6575) [JS] decimal toString does not support negative values
[ https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621431#comment-17621431 ] Jonathan Swenson edited comment on ARROW-6575 at 10/21/22 1:14 AM: --- Is this the correct method to use? It appears as though this still does not support negative numbers, and perhaps also does not support non-zero scale. If so, this is still present in 9.0.0. >From the implementation in the c++ source, I believe there are two missing >pieces. + handling of negative values. (determine if negative and negate before rendering) – [determine if negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125] | [negation |https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385] + scaling values appropriately (insert the decimal place / prepend leading zeros as necessary). [implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386] The conversion to integer string is implemented in c++ [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698] which includes checking if negative before rendering. I see values like: + 101 (with scale 10, precision 13) renders as `10100` – where I expect this as `101.00` + -101 (with scale 10, precision 13) renders as `3402823669209385000` - I expect this as `-101.00` I'm not really a Javascript expert, but a similar approach to the negation check and flipping appears to work, but I'm fairly confident I'm missing some edge cases. General algorithm: + check to see if the high bits are negative + if so, number is negative (prepend with "-") and add 1 to each "chunk" and handle overflows appropriately. + render the string using current implementation. + place decimal place (or prepend with 0s) based on the scale. was (Author: jswenson): Is this the correct method to use? It appears as though this still does not support negative numbers, and perhaps also does not support non-zero scale. >From the implementation in the c++ source, I believe there are two missing >pieces. + handling of negative values. (determine if negative and negate before rendering) – [determine if negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125] [negation | https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385] + scaling values appropriately (insert the decimal place / prepend leading zeros as necessary). [implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386] The conversion to integer string is implemented in c++ [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698] which includes checking if negative before rendering. I see values like: + 101 (with scale 10, precision 13) renders as `10100` – where I expect this as `101.00` + -101 (with scale 10, precision 13) renders as `3402823669209385000` - I expect this as `-101.00` I'm not really a Javascript expert, but a similar approach to the negation check and flipping appears to work, but I'm fairly confident I'm missing some edge cases. General algorithm: + check to see if the high bits are negative + if so, number is negative (prepend with "-") and add 1 to each "chunk" and handle overflows appropriately. + render the string using current implementation. + place decimal place (or prepend with 0s) based on the scale. > [JS] decimal toString does not support negative values > -- > > Key: ARROW-6575 > URL: https://issues.apache.org/jira/browse/ARROW-6575 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.14.1 >Reporter: Andong Zhan >Priority: Critical > > The main description is here: [https://github.com/apache/arrow/issues/5397] > Also, I have a simple test case (slightly changed generate-test-data.js and > generated-data-validators): > {code:java} > export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, > precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), > length, nullCount); > function fillDecimal(length: number) { > // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4 > const array = new Uint32Array(length); > // const max = (2 ** (8 * BPE)) - 1; > // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 > ? -1 : 1)); > array[0] = 0; > array[1] = 1286889712; > array[2] = 2218195178; > array[3] = 4282345521; > array[4] = 0; > array[5] = 16004768; > array[6] = 3587851993; > array[7]
[jira] [Comment Edited] (ARROW-6575) [JS] decimal toString does not support negative values
[ https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621431#comment-17621431 ] Jonathan Swenson edited comment on ARROW-6575 at 10/21/22 1:11 AM: --- Is this the correct method to use? It appears as though this still does not support negative numbers, and perhaps also does not support non-zero scale. >From the implementation in the c++ source, I believe there are two missing >pieces. + handling of negative values. (determine if negative and negate before rendering) – [determine if negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125] [negation | https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385] + scaling values appropriately (insert the decimal place / prepend leading zeros as necessary). [implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386] The conversion to integer string is implemented in c++ [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698] which includes checking if negative before rendering. I see values like: + 101 (with scale 10, precision 13) renders as `10100` – where I expect this as `101.00` + -101 (with scale 10, precision 13) renders as `3402823669209385000` - I expect this as `-101.00` I'm not really a Javascript expert, but a similar approach to the negation check and flipping appears to work, but I'm fairly confident I'm missing some edge cases. General algorithm: + check to see if the high bits are negative + if so, number is negative (prepend with "-") and add 1 to each "chunk" and handle overflows appropriately. + render the string using current implementation. + place decimal place (or prepend with 0s) based on the scale. was (Author: jswenson): Is this the correct method to use? It appears as though this still does not support negative numbers, and perhaps also does not support non-zero scale. >From the implementation in the c++ source, I believe there are two missing >pieces. + handling of negative values. (determine if negative and negate before rendering) – [[determine if negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125]][[negation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385]] + scaling values appropriately (insert the decimal place / prepend leading zeros as necessary). [[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386]] The conversion to integer string is implemented in c++ [[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698]] which includes checking if negative before rendering. I see values like: + 101 (with scale 10, precision 13) renders as `10100` – where I expect this as `101.00` + -101 (with scale 10, precision 13) renders as `3402823669209385000`- I expect this as `-101.00` I'm not really a Javascript expert, but a similar approach to the negation check and flipping appears to work, but I'm fairly confident I'm missing some edge cases. General algorithm: + check to see if the high bits are negative + if so, number is negative (prepend with "-") and add 1 to each "chunk" and handle overflows appropriately. + render the string using current implementation. + place decimal place (or prepend with 0s) based on the scale. > [JS] decimal toString does not support negative values > -- > > Key: ARROW-6575 > URL: https://issues.apache.org/jira/browse/ARROW-6575 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.14.1 >Reporter: Andong Zhan >Priority: Critical > > The main description is here: [https://github.com/apache/arrow/issues/5397] > Also, I have a simple test case (slightly changed generate-test-data.js and > generated-data-validators): > {code:java} > export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, > precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), > length, nullCount); > function fillDecimal(length: number) { > // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4 > const array = new Uint32Array(length); > // const max = (2 ** (8 * BPE)) - 1; > // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 > ? -1 : 1)); > array[0] = 0; > array[1] = 1286889712; > array[2] = 2218195178; > array[3] = 4282345521; > array[4] = 0; > array[5] = 16004768; > array[6] = 3587851993; > array[7] = 126217744; > return array; >
[jira] [Commented] (ARROW-6575) [JS] decimal toString does not support negative values
[ https://issues.apache.org/jira/browse/ARROW-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621431#comment-17621431 ] Jonathan Swenson commented on ARROW-6575: - Is this the correct method to use? It appears as though this still does not support negative numbers, and perhaps also does not support non-zero scale. >From the implementation in the c++ source, I believe there are two missing >pieces. + handling of negative values. (determine if negative and negate before rendering) – [[determine if negative|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.h#L125]][[negation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/basic_decimal.cc#L377-L385]] + scaling values appropriately (insert the decimal place / prepend leading zeros as necessary). [[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L328-L386]] The conversion to integer string is implemented in c++ [[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal.cc#L685-L698]] which includes checking if negative before rendering. I see values like: + 101 (with scale 10, precision 13) renders as `10100` – where I expect this as `101.00` + -101 (with scale 10, precision 13) renders as `3402823669209385000`- I expect this as `-101.00` I'm not really a Javascript expert, but a similar approach to the negation check and flipping appears to work, but I'm fairly confident I'm missing some edge cases. General algorithm: + check to see if the high bits are negative + if so, number is negative (prepend with "-") and add 1 to each "chunk" and handle overflows appropriately. + render the string using current implementation. + place decimal place (or prepend with 0s) based on the scale. > [JS] decimal toString does not support negative values > -- > > Key: ARROW-6575 > URL: https://issues.apache.org/jira/browse/ARROW-6575 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.14.1 >Reporter: Andong Zhan >Priority: Critical > > The main description is here: [https://github.com/apache/arrow/issues/5397] > Also, I have a simple test case (slightly changed generate-test-data.js and > generated-data-validators): > {code:java} > export const decimal = (length = 2, nullCount = length * 0.2 | 0, scale = 0, > precision = 38) => vectorGenerator.visit(new Decimal(scale, precision), > length, nullCount); > function fillDecimal(length: number) { > // const BPE = Uint32Array.BYTES_PER_ELEMENT; // 4 > const array = new Uint32Array(length); > // const max = (2 ** (8 * BPE)) - 1; > // for (let i = -1; ++i < length; array[i] = rand() * max * (rand() > 0.5 > ? -1 : 1)); > array[0] = 0; > array[1] = 1286889712; > array[2] = 2218195178; > array[3] = 4282345521; > array[4] = 0; > array[5] = 16004768; > array[6] = 3587851993; > array[7] = 126217744; > return array; > } > {code} > and the expected value should be > {code:java} > expect(vector.get(0).toString()).toBe('-1'); > expect(vector.get(1).toString()).toBe('1'); > {code} > However, the actual first value is 339282366920938463463374607431768211456 > which is wrong! The second value is correct by the way. > I believe the bug is in the function called > function decimalToString>(a: T) because it cannot > return a negative value at all. > [arrow/js/src/util/bn.ts|https://github.com/apache/arrow/blob/d54425de19b7dbb2764a40355d76d1c785cf64ec/js/src/util/bn.ts#L99] > Line 99 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18113) Implement a read range process without caching
[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621428#comment-17621428 ] David Li commented on ARROW-18113: -- Ah ok I was forgetting that we could theoretically split up reads. Thanks! > Implement a read range process without caching > -- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18113) Implement a read range process without caching
[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621427#comment-17621427 ] Weston Pace edited comment on ARROW-18113 at 10/21/22 12:43 AM: > Just to be clear: to the filesystem, or on the reader itself? Oops, I mean on {{RandomAccessFile}}. > Also, I'm not clear on: "Multiple returned futures may correspond to a single > read. Or, a single returned future may be a combined result of several > individual reads." Isn't this saying the same thing twice? I might call {noformat} file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi}) {noformat}. The filesystem could then implement this as: {noformat} std::vector futures; # The first two futures correspond to the same read Future coalesced_read = ReadAsync(0, 8); futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3))); futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5))); # The third future corresponds to two reads Future part_one = ReadAsync(1024, 8Mi); Future part_two = ReadAsync(1024+8Mi, 8Mi-1024); futures.push_back(AllComplete({part_one, part_two}).Then(bufs => Concatenate(bufs)); {noformat} was (Author: westonpace): > Just to be clear: to the filesystem, or on the reader itself? Oops, I mean on {{RandomAccessFile}}. > Also, I'm not clear on: "Multiple returned futures may correspond to a single > read. Or, a single returned future may be a combined result of several > individual reads." Isn't this saying the same thing twice? I might call {{file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})}}. The filesystem could then implement this as: {noformat} std::vector futures; # The first two futures correspond to the same read Future coalesced_read = ReadAsync(0, 8); futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3))); futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5))); # The third future corresponds to two reads Future part_one = ReadAsync(1024, 8Mi); Future part_two = ReadAsync(1024+8Mi, 8Mi-1024); futures.push_back(AllComplete({part_one, part_two}).Then(bufs => Concatenate(bufs)); {noformat} > Implement a read range process without caching > -- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18113) Implement a read range process without caching
[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621427#comment-17621427 ] Weston Pace commented on ARROW-18113: - > Just to be clear: to the filesystem, or on the reader itself? Oops, I mean on {{RandomAccessFile}}. > Also, I'm not clear on: "Multiple returned futures may correspond to a single > read. Or, a single returned future may be a combined result of several > individual reads." Isn't this saying the same thing twice? I might call {{file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})}}. The filesystem could then implement this as: {noformat} std::vector futures; # The first two futures correspond to the same read Future coalesced_read = ReadAsync(0, 8); futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3))); futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5))); # The third future corresponds to two reads Future part_one = ReadAsync(1024, 8Mi); Future part_two = ReadAsync(1024+8Mi, 8Mi-1024); futures.push_back(AllComplete({part_one, part_two}).Then(bufs => Concatenate(bufs)); {noformat} > Implement a read range process without caching > -- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18118) [Release][Dev] 02-source.sh/03-binary-submit.sh didn't work for 10.0.0
[ https://issues.apache.org/jira/browse/ARROW-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18118: --- Labels: pull-request-available (was: ) > [Release][Dev] 02-source.sh/03-binary-submit.sh didn't work for 10.0.0 > -- > > Key: ARROW-18118 > URL: https://issues.apache.org/jira/browse/ARROW-18118 > Project: Apache Arrow > Issue Type: Bug > Components: Developer Tools >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > - Wrong variable names are used > - Missing variable definitions > - Requiring multiple environment variables for GitHub Personal Access Token -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0
[ https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621425#comment-17621425 ] Weston Pace commented on ARROW-18102: - Supposedly both behaviors are useful (returning null is SQL standards compliant. See https://database.guide/sqlite-sum-vs-total-whats-the-difference/). I think we could add an option to the sum function. See also: https://github.com/substrait-io/substrait/issues/259 so I expect we will eventually need both behaviors. > [R] dplyr::count and dplyr::tally implementation return NA instead of 0 > --- > > Key: ARROW-18102 > URL: https://issues.apache.org/jira/browse/ARROW-18102 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0 >Reporter: Adam Black >Priority: Major > > I'm using dplyr with FileSystemDataset objects. The expected behavior is > similar (or the same as) dataframe behavior. When the FileSystemDataset has > zero rows dplyr::count and dplyr::tally return NA instead of 0. I would > expect the result to be 0. > > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > path <- tempfile(fileext = ".feather") > zero_row_dataset <- cars %>% filter(dist < 0) > # expected behavior > zero_row_dataset %>% > count() > #> n > #> 1 0 > zero_row_dataset %>% > tally() > #> n > #> 1 0 > nrow(zero_row_dataset) > #> [1] 0 > # now test behavior with a FileSystemDataset > write_feather(zero_row_dataset, path) > ds <- open_dataset(path, format = "feather") > ds > #> FileSystemDataset with 1 Feather file > #> speed: double > #> dist: double > #> > #> See $metadata for additional Schema metadata > # actual behavior > ds %>% > count() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > ds %>% > tally() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > nrow(ds) # works as expected > #> [1] 0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18118) [Release][Dev] 02-source.sh/03-binary-submit.sh didn't work for 10.0.0
Kouhei Sutou created ARROW-18118: Summary: [Release][Dev] 02-source.sh/03-binary-submit.sh didn't work for 10.0.0 Key: ARROW-18118 URL: https://issues.apache.org/jira/browse/ARROW-18118 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou - Wrong variable names are used - Missing variable definitions - Requiring multiple environment variables for GitHub Personal Access Token -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17207) [C++][CI] Occasional timeout failures on arrow-compute-scalar-test
[ https://issues.apache.org/jira/browse/ARROW-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17207: --- Labels: pull-request-available (was: ) > [C++][CI] Occasional timeout failures on arrow-compute-scalar-test > -- > > Key: ARROW-17207 > URL: https://issues.apache.org/jira/browse/ARROW-17207 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Raúl Cumplido >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Sometimes C++ tests fail due to a timeout on `arrow-compute-scalar-test`: > {code:java} > 30/85 Test #29: arrow-compute-scalar-test .***Timeout 300.04 > sec > Running arrow-compute-scalar-test, redirecting output into > /build/cpp/build/test-logs/arrow-compute-scalar-test.txt (attempt 1/1) {code} > Job failure example: > [https://github.com/ursacomputing/crossbow/runs/7511361872?check_suite_focus=true] > I've realized that even when it run successfully it takes around 4 minutes > (timeout is 5 minutes): > {code:java} > 32/85 Test #29: arrow-compute-scalar-test . Passed 229.77 > sec{code} > Should these tests be split? > {code:java} > add_arrow_compute_test(scalar_test > SOURCES > scalar_arithmetic_test.cc > scalar_boolean_test.cc > scalar_cast_test.cc > scalar_compare_test.cc > scalar_if_else_test.cc > scalar_nested_test.cc > scalar_random_test.cc > scalar_set_lookup_test.cc > scalar_string_test.cc > scalar_temporal_test.cc > scalar_validity_test.cc > test_util.cc) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18117) [C++] Absl symbols not included in `arrow_bundled_dependencies`
[ https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-18117. -- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14465 [https://github.com/apache/arrow/pull/14465] > [C++] Absl symbols not included in `arrow_bundled_dependencies` > --- > > Key: ARROW-18117 > URL: https://issues.apache.org/jira/browse/ARROW-18117 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yves Le Maout >Assignee: Yves Le Maout >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18117) [C++] Absl symbols not included in `arrow_bundled_dependencies`
[ https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-18117: - Summary: [C++] Absl symbols not included in `arrow_bundled_dependencies` (was: Absl symbols not included in `arrow_bundled_dependencies`) > [C++] Absl symbols not included in `arrow_bundled_dependencies` > --- > > Key: ARROW-18117 > URL: https://issues.apache.org/jira/browse/ARROW-18117 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yves Le Maout >Assignee: Yves Le Maout >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment
[ https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621415#comment-17621415 ] David Li commented on ARROW-17783: -- I guess if you're going for 512 byte alignment then you'd have to deal with this anyways though. > [C++] Aggregate kernel should not mandate alignment > --- > > Key: ARROW-17783 > URL: https://issues.apache.org/jira/browse/ARROW-17783 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.0, 8.0.0 >Reporter: Yifei Yang >Assignee: Weston Pace >Priority: Major > Attachments: flight-alignment-test.zip > > > When using arrow's aggregate kernel with table transferred from arrow flight > (DoGet), it may crash at arrow::util::CheckAlignment(). However using > original data it works well, also if I first serialize the transferred table > into bytes then recreate an arrow table using the bytes, it works well. > "flight-alignment-test" attached is the minimal test that can produce the > issue, which basically does "sum(total_revenue) group by l_suppkey" using the > table from "DoGet()". ("DummyNode" is just used to be the producer of the > aggregate node as the producer is required to create the aggregate node) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment
[ https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621414#comment-17621414 ] David Li commented on ARROW-17783: -- It should also not be too bad to fix this in Flight (given gRPC generally forces a copy on us anyways); we would only lose the zero-copy in the case that the batch fits in a single gRPC slice (which is presumably relatively small, but I'd have to check what a typical size is). > [C++] Aggregate kernel should not mandate alignment > --- > > Key: ARROW-17783 > URL: https://issues.apache.org/jira/browse/ARROW-17783 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.0, 8.0.0 >Reporter: Yifei Yang >Assignee: Weston Pace >Priority: Major > Attachments: flight-alignment-test.zip > > > When using arrow's aggregate kernel with table transferred from arrow flight > (DoGet), it may crash at arrow::util::CheckAlignment(). However using > original data it works well, also if I first serialize the transferred table > into bytes then recreate an arrow table using the bytes, it works well. > "flight-alignment-test" attached is the minimal test that can produce the > issue, which basically does "sum(total_revenue) group by l_suppkey" using the > table from "DoGet()". ("DummyNode" is just used to be the producer of the > aggregate node as the producer is required to create the aggregate node) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`
[ https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18117: --- Labels: pull-request-available (was: ) > Absl symbols not included in `arrow_bundled_dependencies` > - > > Key: ARROW-18117 > URL: https://issues.apache.org/jira/browse/ARROW-18117 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yves Le Maout >Assignee: Yves Le Maout >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`
[ https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yves Le Maout updated ARROW-18117: -- Component/s: C++ > Absl symbols not included in `arrow_bundled_dependencies` > - > > Key: ARROW-18117 > URL: https://issues.apache.org/jira/browse/ARROW-18117 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yves Le Maout >Assignee: Yves Le Maout >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`
[ https://issues.apache.org/jira/browse/ARROW-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yves Le Maout reassigned ARROW-18117: - Assignee: Yves Le Maout > Absl symbols not included in `arrow_bundled_dependencies` > - > > Key: ARROW-18117 > URL: https://issues.apache.org/jira/browse/ARROW-18117 > Project: Apache Arrow > Issue Type: Bug >Reporter: Yves Le Maout >Assignee: Yves Le Maout >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18117) Absl symbols not included in `arrow_bundled_dependencies`
Yves Le Maout created ARROW-18117: - Summary: Absl symbols not included in `arrow_bundled_dependencies` Key: ARROW-18117 URL: https://issues.apache.org/jira/browse/ARROW-18117 Project: Apache Arrow Issue Type: Bug Reporter: Yves Le Maout -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17207) [C++][CI] Occasional timeout failures on arrow-compute-scalar-test
[ https://issues.apache.org/jira/browse/ARROW-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reassigned ARROW-17207: --- Assignee: Weston Pace > [C++][CI] Occasional timeout failures on arrow-compute-scalar-test > -- > > Key: ARROW-17207 > URL: https://issues.apache.org/jira/browse/ARROW-17207 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Raúl Cumplido >Assignee: Weston Pace >Priority: Major > Fix For: 11.0.0 > > > Sometimes C++ tests fail due to a timeout on `arrow-compute-scalar-test`: > {code:java} > 30/85 Test #29: arrow-compute-scalar-test .***Timeout 300.04 > sec > Running arrow-compute-scalar-test, redirecting output into > /build/cpp/build/test-logs/arrow-compute-scalar-test.txt (attempt 1/1) {code} > Job failure example: > [https://github.com/ursacomputing/crossbow/runs/7511361872?check_suite_focus=true] > I've realized that even when it run successfully it takes around 4 minutes > (timeout is 5 minutes): > {code:java} > 32/85 Test #29: arrow-compute-scalar-test . Passed 229.77 > sec{code} > Should these tests be split? > {code:java} > add_arrow_compute_test(scalar_test > SOURCES > scalar_arithmetic_test.cc > scalar_boolean_test.cc > scalar_cast_test.cc > scalar_compare_test.cc > scalar_if_else_test.cc > scalar_nested_test.cc > scalar_random_test.cc > scalar_set_lookup_test.cc > scalar_string_test.cc > scalar_temporal_test.cc > scalar_validity_test.cc > test_util.cc) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17207) [C++][CI] Occasional timeout failures on arrow-compute-scalar-test
[ https://issues.apache.org/jira/browse/ARROW-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621404#comment-17621404 ] Weston Pace commented on ARROW-17207: - In this case I think I would prefer splitting. There is no real upper bound on the number of scalar kernels we might add and I don't think most of our scalar tests run in parallel so we might get some performance benefit (probably just on many-core dev machines) from splitting. I'll make a quick PR. I'll also check to see if any of them seem unreasonably slow. > [C++][CI] Occasional timeout failures on arrow-compute-scalar-test > -- > > Key: ARROW-17207 > URL: https://issues.apache.org/jira/browse/ARROW-17207 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Raúl Cumplido >Priority: Major > Fix For: 11.0.0 > > > Sometimes C++ tests fail due to a timeout on `arrow-compute-scalar-test`: > {code:java} > 30/85 Test #29: arrow-compute-scalar-test .***Timeout 300.04 > sec > Running arrow-compute-scalar-test, redirecting output into > /build/cpp/build/test-logs/arrow-compute-scalar-test.txt (attempt 1/1) {code} > Job failure example: > [https://github.com/ursacomputing/crossbow/runs/7511361872?check_suite_focus=true] > I've realized that even when it run successfully it takes around 4 minutes > (timeout is 5 minutes): > {code:java} > 32/85 Test #29: arrow-compute-scalar-test . Passed 229.77 > sec{code} > Should these tests be split? > {code:java} > add_arrow_compute_test(scalar_test > SOURCES > scalar_arithmetic_test.cc > scalar_boolean_test.cc > scalar_cast_test.cc > scalar_compare_test.cc > scalar_if_else_test.cc > scalar_nested_test.cc > scalar_random_test.cc > scalar_set_lookup_test.cc > scalar_string_test.cc > scalar_temporal_test.cc > scalar_validity_test.cc > test_util.cc) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17783) [C++] Aggregate kernel should not mandate alignment
[ https://issues.apache.org/jira/browse/ARROW-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621402#comment-17621402 ] Weston Pace commented on ARROW-17783: - My concern is less performance and more complexity and testing. I've made a proposal at ARROW-18115 which would address this (albeit this particular case would take a performance hit) > [C++] Aggregate kernel should not mandate alignment > --- > > Key: ARROW-17783 > URL: https://issues.apache.org/jira/browse/ARROW-17783 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.0, 8.0.0 >Reporter: Yifei Yang >Assignee: Weston Pace >Priority: Major > Attachments: flight-alignment-test.zip > > > When using arrow's aggregate kernel with table transferred from arrow flight > (DoGet), it may crash at arrow::util::CheckAlignment(). However using > original data it works well, also if I first serialize the transferred table > into bytes then recreate an arrow table using the bytes, it works well. > "flight-alignment-test" attached is the minimal test that can produce the > issue, which basically does "sum(total_revenue) group by l_suppkey" using the > table from "DoGet()". ("DummyNode" is just used to be the producer of the > aggregate node as the producer is required to create the aggregate node) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18113) Implement a read range process without caching
[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621401#comment-17621401 ] David Li commented on ARROW-18113: -- Just to be clear: to the filesystem, or on the reader itself? Also, I'm not clear on: "Multiple returned futures may correspond to a single read. Or, a single returned future may be a combined result of several individual reads." Isn't this saying the same thing twice? > Implement a read range process without caching > -- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18113) Implement a read range process without caching
[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621398#comment-17621398 ] Weston Pace commented on ARROW-18113: - On reflection, I don't really prefer my automagic suggestion. I think an explicit multi-read API added to the filesystem would be a good way to go. I don't see it as an extension of ReadAsync though. Something like: {noformat} /// \brief Request multiple reads at once /// /// The underlying filesystem may optimize these reads by coalescing small reads into /// large reads or by breaking up large reads into multiple parallel smaller reads. The /// reads should be issued in parallel if it makes sense for the filesystem. /// /// One future will be returned for each input read range. Multiple returned futures /// may correspond to a single read. Or, a single returned future may be a combined /// result of several individual reads. /// /// \param[in] ranges The ranges to read /// \return A future that will complete with the data from the requested range is /// available virtual std::vector>> ReadManyAsync( const IOContext&, const std::vector& ranges); {noformat} There could be a default implementation (perhaps relying on configurable protected min_hole_size_ and max_contiguous_read_size_ variables) so that filesystems would only need to provide a specialized alternative where it made sense. In the future it would be interesting to benchmark and see if [preadv|https://linux.die.net/man/2/preadv] can be used to provide a more optimized version for the local filesystem. I'd also be curious to know how an API like this could be adapted (or whether my proposal fits) for something like io_uring [~sakras] > Implement a read range process without caching > -- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18116) [R][Doc] correct paths for the read_parquet examples in cloud storage vignette
[ https://issues.apache.org/jira/browse/ARROW-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephanie Hazlitt updated ARROW-18116: -- Description: {{The S3 file paths don't run:}} {code:java} > library(arrow) > read_parquet(file = > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") Error in url(file, open = "rb") : URL scheme unsupported by this method{code} {{It looks like the file names are `part-0.parquet` not `data.parquet`.}} {{This runs:}} {code:java} read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code} was: {{The S3 file paths don't run:}} {code:java} > library(arrow) > read_parquet(file = > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") Error in url(file, open = "rb") : URL scheme unsupported by this method{code} {{It looks like the file names are `part-0.parquet` not `data.parquet`.}} {{This runs:}} {code:java} read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code} > [R][Doc] correct paths for the read_parquet examples in cloud storage vignette > -- > > Key: ARROW-18116 > URL: https://issues.apache.org/jira/browse/ARROW-18116 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, R >Reporter: Stephanie Hazlitt >Priority: Major > > {{The S3 file paths don't run:}} > {code:java} > > library(arrow) > > read_parquet(file = > > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") > Error in url(file, open = "rb") : URL scheme unsupported by this method{code} > {{It looks like the file names are `part-0.parquet` not `data.parquet`.}} > {{This runs:}} > {code:java} > read_parquet(file = > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18116) [R][Doc] correct paths for the read_parquet examples in cloud storage vignette
[ https://issues.apache.org/jira/browse/ARROW-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephanie Hazlitt updated ARROW-18116: -- Description: {{The S3 file paths don't run:}} {code:java} > library(arrow) > read_parquet(file = > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") Error in url(file, open = "rb") : URL scheme unsupported by this method{code} {{It looks like the file names are `part-0.parquet` not `data.parquet`.}} {{This runs:}} {code:java} read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code} was: {{The S3 file paths don't run:}} {{}} {code:java} > library(arrow) > read_parquet(file = > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") Error in url(file, open = "rb") : URL scheme unsupported by this method{code} {{It looks like the file names are `part-0.parquet` not `data.parquet`.}} {{This runs:}} {{}} {code:java} read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code} {{}} > [R][Doc] correct paths for the read_parquet examples in cloud storage vignette > -- > > Key: ARROW-18116 > URL: https://issues.apache.org/jira/browse/ARROW-18116 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, R >Reporter: Stephanie Hazlitt >Priority: Major > > {{The S3 file paths don't run:}} > > {code:java} > > library(arrow) > > read_parquet(file = > > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") > Error in url(file, open = "rb") : URL scheme unsupported by this method{code} > {{It looks like the file names are `part-0.parquet` not `data.parquet`.}} > {{This runs:}} > {code:java} > read_parquet(file = > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18116) [R][Doc] correct paths for the read_parquet examples in cloud storage vignette
Stephanie Hazlitt created ARROW-18116: - Summary: [R][Doc] correct paths for the read_parquet examples in cloud storage vignette Key: ARROW-18116 URL: https://issues.apache.org/jira/browse/ARROW-18116 Project: Apache Arrow Issue Type: Bug Components: Documentation, R Reporter: Stephanie Hazlitt {{The S3 file paths don't run:}} {{}} {code:java} > library(arrow) > read_parquet(file = > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") Error in url(file, open = "rb") : URL scheme unsupported by this method{code} {{It looks like the file names are `part-0.parquet` not `data.parquet`.}} {{This runs:}} {{}} {code:java} read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code} {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18115) [C++] Acero buffer alignment
[ https://issues.apache.org/jira/browse/ARROW-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621395#comment-17621395 ] Weston Pace commented on ARROW-18115: - CC [~apitrou][~sakras][~michalno][~marsupialtail][~bkietz] and maybe [~rtpsw][~icexelloss] would be interested. > [C++] Acero buffer alignment > > > Key: ARROW-18115 > URL: https://issues.apache.org/jira/browse/ARROW-18115 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > This is a general JIRA to centralize some discussion / proposal on a buffer > alignment strategy for Acero based on discussions that have happened in a few > different contexts. Any actual work will probably span multiple JIRAs, some > of which are already in progress. > Motivations: > * Incoming data may not be aligned at all. Some kernel functions and parts > of Acero (e.g. aggregation, join) use explicit SIMD instructions (e.g. > intrinsics) and will fail / cause corruption if care is not taken to use > unaligned loads (e.g. ARROW-17783). There are parts of the code that assume > fixed arrays with size T are at least T aligned. This is generally a safe > assumption for data generated by arrow-c++ (which allocates all buffers with > 64 byte alignment) but then leads to errors when processing data from flight. > * Dataset writes and mid-plan spilling both can benefit form direct I/O, less > for performance reasons and more for memory management reasons. However, in > order to use direct I/O a buffer needs to be aligned, typically to 512 bytes. > This is larger than the current minimum alignment requirements. > Proposal: > * Allow the minimum alignment of a memory pool to be configurable. This is > similar to the proposal of ARROW-17836 but does not require much of an API > change (at the expense of being slightly less flexible). > * Add a capability to realign a buffer/array/batch/table to a target > alignment. This would only modify buffers that are not already aligned. > Basically, given an arrow object, and a target memory pool, migrate a buffer > to the target memory pool if its alignment is insufficient. > * Acero, in the source node, forces all buffers to be 64 byte aligned. This > way the internals of Acero don't have to worry about this case. This > introduces a performance penalty when buffers are not aligned but is much > simpler to maintain and test than trying to support any random buffer. To > avoid this penalty it would be simpler to avoid the unaligned buffers in the > first place. > * Acero requires a memory pool that has 512-byte alignment so that > Acero-allocated buffers can use direct I/O. If the default memory pool does > not have 512-byte alignment then Acero can use a per-plan pool. This covers > the common case for spilling and dataset writing which is that we are > partitioning prior to the write and so we are writing Acero-allocated buffers > anyways. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18115) [C++] Acero buffer alignment
Weston Pace created ARROW-18115: --- Summary: [C++] Acero buffer alignment Key: ARROW-18115 URL: https://issues.apache.org/jira/browse/ARROW-18115 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace This is a general JIRA to centralize some discussion / proposal on a buffer alignment strategy for Acero based on discussions that have happened in a few different contexts. Any actual work will probably span multiple JIRAs, some of which are already in progress. Motivations: * Incoming data may not be aligned at all. Some kernel functions and parts of Acero (e.g. aggregation, join) use explicit SIMD instructions (e.g. intrinsics) and will fail / cause corruption if care is not taken to use unaligned loads (e.g. ARROW-17783). There are parts of the code that assume fixed arrays with size T are at least T aligned. This is generally a safe assumption for data generated by arrow-c++ (which allocates all buffers with 64 byte alignment) but then leads to errors when processing data from flight. * Dataset writes and mid-plan spilling both can benefit form direct I/O, less for performance reasons and more for memory management reasons. However, in order to use direct I/O a buffer needs to be aligned, typically to 512 bytes. This is larger than the current minimum alignment requirements. Proposal: * Allow the minimum alignment of a memory pool to be configurable. This is similar to the proposal of ARROW-17836 but does not require much of an API change (at the expense of being slightly less flexible). * Add a capability to realign a buffer/array/batch/table to a target alignment. This would only modify buffers that are not already aligned. Basically, given an arrow object, and a target memory pool, migrate a buffer to the target memory pool if its alignment is insufficient. * Acero, in the source node, forces all buffers to be 64 byte aligned. This way the internals of Acero don't have to worry about this case. This introduces a performance penalty when buffers are not aligned but is much simpler to maintain and test than trying to support any random buffer. To avoid this penalty it would be simpler to avoid the unaligned buffers in the first place. * Acero requires a memory pool that has 512-byte alignment so that Acero-allocated buffers can use direct I/O. If the default memory pool does not have 512-byte alignment then Acero can use a per-plan pool. This covers the common case for spilling and dataset writing which is that we are partitioning prior to the write and so we are writing Acero-allocated buffers anyways. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17599) [C++] ReadRangeCache should not retain data after read
[ https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621368#comment-17621368 ] Percy Camilo Triveño Aucahuasi commented on ARROW-17599: Another follow up/related ticket: https://issues.apache.org/jira/browse/ARROW-18113 > [C++] ReadRangeCache should not retain data after read > -- > > Key: ARROW-17599 > URL: https://issues.apache.org/jira/browse/ARROW-17599 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > Labels: good-second-issue, pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > I've added a unit test of the issue here: > https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention > We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes > those files are quite large (gigabytes). The usage is roughly: > for X in num_row_groups: > CacheAllThePiecesWeNeedForRowGroupX > WaitForPiecesToArriveForRowGroupX > ReadThePiecesWeNeedForRowGroupX > However, once we've read in row group X and passed it on to Acero, etc. we do > not release the data for row group X. The read range cache's entries vector > still holds a pointer to the buffer. The data is not released until the file > reader itself is destroyed which only happens when we have finished > processing an entire file. > This leads to excessive memory usage when pre-buffering is enabled. > This could potentially be a little difficult to implement because a single > read range's cache entry could be shared by multiple ranges so we will need > some kind of reference counting to know when we have fully finished with an > entry and can release it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18113) Implement a read range process without caching
[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621309#comment-17621309 ] David Li commented on ARROW-18113: -- (Of course, that's only if you choose to go with an explicit API, vs Weston's suggestion of possibly doing it 'automagically') > Implement a read range process without caching > -- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18113) Implement a read range process without caching
[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621308#comment-17621308 ] David Li commented on ARROW-18113: -- I'd like to raise my comments in ARROW-17913 and ARROW-17917 as well, especially https://issues.apache.org/jira/browse/ARROW-17913?focusedCommentId=17614155=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17614155 . Note Weston offers suggestions there too about how we might handle this. The API would probably be an extension of RandomAccessFile::ReadAsync. The file system would come into play by returning a subclass of RAF that overrides the new method and does coalescing appropriate to the underlying device > Implement a read range process without caching > -- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times
Carl Boettiger created ARROW-18114: -- Summary: [R] unify_schemas=FALSE does not improve open_dataset() read times Key: ARROW-18114 URL: https://issues.apache.org/jira/browse/ARROW-18114 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Carl Boettiger open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema. This ought to provide a substantial performance increase in contexts where the schema is known in advance. Unfortunately, in my tests it seems to have no impact on performance. Consider the following reprexes: default, unify_schemas=TRUE library(arrow) ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE) bench::bench_time({ open_dataset(ex) }) about 32 seconds for me. manual, unify_schemas=FALSE: bench::bench_time(\{ open_dataset(ex, unify_schemas = FALSE) }) takes about 32 seconds as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18113) Implement a read range process without caching
Percy Camilo Triveño Aucahuasi created ARROW-18113: -- Summary: Implement a read range process without caching Key: ARROW-18113 URL: https://issues.apache.org/jira/browse/ARROW-18113 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Percy Camilo Triveño Aucahuasi Assignee: Percy Camilo Triveño Aucahuasi The current [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] is mixing caching with coalescing and making difficult to implement readers capable to really perform concurrent reads on coalesced data (see this [github comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for additional context); for instance, right now the prebuffering feature of those readers cannot handle concurrent invocations. The goal for this ticket is to implement a similar component to ReadRangeCache for performing non-cache reads (doing only the coalescing part instead). So, once we have that new capability, we can port the parquet and IPC readers to this new component and keep improving the reading process (that would be part of other set of follow-up tickets). Similar ideas were mentioned here https://issues.apache.org/jira/browse/ARROW-17599 Maybe a good place to implement this new capability is inside the file system abstraction (as part of a dedicated method to read coalesced data) and where the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15641) [C++][Python] UDF Aggregate Function Implementation
[ https://issues.apache.org/jira/browse/ARROW-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621290#comment-17621290 ] Apache Arrow JIRA Bot commented on ARROW-15641: --- This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++][Python] UDF Aggregate Function Implementation > --- > > Key: ARROW-15641 > URL: https://issues.apache.org/jira/browse/ARROW-15641 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > Here we will be implementing the UDF support for aggregate functions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16041) [C++][Python] Include UDFOptions
[ https://issues.apache.org/jira/browse/ARROW-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621291#comment-17621291 ] Apache Arrow JIRA Bot commented on ARROW-16041: --- This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++][Python] Include UDFOptions > > > Key: ARROW-16041 > URL: https://issues.apache.org/jira/browse/ARROW-16041 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > In the first stage of the development, we do not support function options to > be taken from Python UDFs. But this is a feature that is required for > advanced users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15646) [C++][Python] UDF Vector Function Implementation
[ https://issues.apache.org/jira/browse/ARROW-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621288#comment-17621288 ] Apache Arrow JIRA Bot commented on ARROW-15646: --- This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++][Python] UDF Vector Function Implementation > - > > Key: ARROW-15646 > URL: https://issues.apache.org/jira/browse/ARROW-15646 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > Here we will implement the vector functions for UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15641) [C++][Python] UDF Aggregate Function Implementation
[ https://issues.apache.org/jira/browse/ARROW-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-15641: - Assignee: (was: Vibhatha Lakmal Abeykoon) > [C++][Python] UDF Aggregate Function Implementation > --- > > Key: ARROW-15641 > URL: https://issues.apache.org/jira/browse/ARROW-15641 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Priority: Major > > Here we will be implementing the UDF support for aggregate functions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15637) [C++][Python] UDF Optimizations
[ https://issues.apache.org/jira/browse/ARROW-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-15637: - Assignee: (was: Vibhatha Lakmal Abeykoon) > [C++][Python] UDF Optimizations > --- > > Key: ARROW-15637 > URL: https://issues.apache.org/jira/browse/ARROW-15637 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Priority: Minor > > Need an interface to evaluate the memory footprint, execution time and health > of the UDFs and return a meaningful status ex: > `Status::HighMemoryUsageException()`, `Status::TimeLimitException()` > Note: This is also aligned with resource monitoring in the parallel execution > space. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16041) [C++][Python] Include UDFOptions
[ https://issues.apache.org/jira/browse/ARROW-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-16041: - Assignee: (was: Vibhatha Lakmal Abeykoon) > [C++][Python] Include UDFOptions > > > Key: ARROW-16041 > URL: https://issues.apache.org/jira/browse/ARROW-16041 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Priority: Major > > In the first stage of the development, we do not support function options to > be taken from Python UDFs. But this is a feature that is required for > advanced users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15646) [C++][Python] UDF Vector Function Implementation
[ https://issues.apache.org/jira/browse/ARROW-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-15646: - Assignee: (was: Vibhatha Lakmal Abeykoon) > [C++][Python] UDF Vector Function Implementation > - > > Key: ARROW-15646 > URL: https://issues.apache.org/jira/browse/ARROW-15646 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Priority: Major > > Here we will implement the vector functions for UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15637) [C++][Python] UDF Optimizations
[ https://issues.apache.org/jira/browse/ARROW-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621289#comment-17621289 ] Apache Arrow JIRA Bot commented on ARROW-15637: --- This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++][Python] UDF Optimizations > --- > > Key: ARROW-15637 > URL: https://issues.apache.org/jira/browse/ARROW-15637 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Minor > > Need an interface to evaluate the memory footprint, execution time and health > of the UDFs and return a meaningful status ex: > `Status::HighMemoryUsageException()`, `Status::TimeLimitException()` > Note: This is also aligned with resource monitoring in the parallel execution > space. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17711) [Go] RLE Arrays To/From JSON
[ https://issues.apache.org/jira/browse/ARROW-17711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol closed ARROW-17711. - Resolution: Duplicate > [Go] RLE Arrays To/From JSON > > > Key: ARROW-17711 > URL: https://issues.apache.org/jira/browse/ARROW-17711 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"
[ https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621277#comment-17621277 ] Ben Harkins commented on ARROW-18106: - That is indeed unexpected... especially since it comes back as a plain string in the first case. I suspect it's an issue with timestamps specifically (or potentially any non-string type with a json string representation). Test coverage seems to be lacking in this area. I'll take a look at it. > [C++] JSON reader ignores explicit schema with default > unexpected_field_behavior="infer" > > > Key: ARROW-18106 > URL: https://issues.apache.org/jira/browse/ARROW-18106 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Harkins >Priority: Major > Labels: json > > Not 100% sure this is a "bug", but at least I find it an unexpected interplay > between two options. > By default, when reading json, we _infer_ the data type of columns, and when > specifying an explicit schema, we _also_ by default infer the type of columns > that are not specified in the explicit schema. The docs for > {{unexpected_field_behavior}}: > > How JSON fields outside of explicit_schema (if given) are treated > But it seems that if you specify a schema, and the parsing of one of the > columns fails according to that schema, we still fall back to this default of > inferring the data type (while I would have expected an error, since we > should only infer for columns _not_ in the schema. > Example code using pyarrow: > {code:python} > import io > import pyarrow as pa > from pyarrow import json > s_json = """{"column":"2022-09-05T08:08:46.000"}""" > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))])) > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > The parsing fails here because there are milliseconds and the type is "s", > but the explicit schema is ignored, and we get a result with a string column > as result: > {code} > pyarrow.Table > column: string > > column: [["2022-09-05T08:08:46.000"]] > {code} > But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the > expected parse error: > {code:python} > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))]), unexpected_field_behavior="ignore") > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > gives > {code} > ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't > parse:2022-09-05T08:08:46.000 > {code} > It might be this is specific to timestamps, I don't directly see a similar > issue with eg {{"column": "A"}} and setting the schema to "column" being > int64. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18112) [Go] Remaining Scalar Unary Arithmetic (sin/cos/etc. rounding, log/ln, etc.)
Matthew Topol created ARROW-18112: - Summary: [Go] Remaining Scalar Unary Arithmetic (sin/cos/etc. rounding, log/ln, etc.) Key: ARROW-18112 URL: https://issues.apache.org/jira/browse/ARROW-18112 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18110) [Go] Scalar Comparisons
Matthew Topol created ARROW-18110: - Summary: [Go] Scalar Comparisons Key: ARROW-18110 URL: https://issues.apache.org/jira/browse/ARROW-18110 Project: Apache Arrow Issue Type: Sub-task Reporter: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18111) [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts)
Matthew Topol created ARROW-18111: - Summary: [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts) Key: ARROW-18111 URL: https://issues.apache.org/jira/browse/ARROW-18111 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18109) [Go] Scalar Unary Arithmetic (abs, neg, sqrt, sign)
[ https://issues.apache.org/jira/browse/ARROW-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-18109: -- Summary: [Go] Scalar Unary Arithmetic (abs, neg, sqrt, sign) (was: [Go] Scalar Unary Arithmetic) > [Go] Scalar Unary Arithmetic (abs, neg, sqrt, sign) > --- > > Key: ARROW-18109 > URL: https://issues.apache.org/jira/browse/ARROW-18109 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18109) [Go] Scalar Unary Arithmetic
Matthew Topol created ARROW-18109: - Summary: [Go] Scalar Unary Arithmetic Key: ARROW-18109 URL: https://issues.apache.org/jira/browse/ARROW-18109 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18108) [Go] More Scalar Binary Arithmetic (Multiply & Divide)
Matthew Topol created ARROW-18108: - Summary: [Go] More Scalar Binary Arithmetic (Multiply & Divide) Key: ARROW-18108 URL: https://issues.apache.org/jira/browse/ARROW-18108 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"
[ https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Harkins reassigned ARROW-18106: --- Assignee: Ben Harkins > [C++] JSON reader ignores explicit schema with default > unexpected_field_behavior="infer" > > > Key: ARROW-18106 > URL: https://issues.apache.org/jira/browse/ARROW-18106 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Harkins >Priority: Major > Labels: json > > Not 100% sure this is a "bug", but at least I find it an unexpected interplay > between two options. > By default, when reading json, we _infer_ the data type of columns, and when > specifying an explicit schema, we _also_ by default infer the type of columns > that are not specified in the explicit schema. The docs for > {{unexpected_field_behavior}}: > > How JSON fields outside of explicit_schema (if given) are treated > But it seems that if you specify a schema, and the parsing of one of the > columns fails according to that schema, we still fall back to this default of > inferring the data type (while I would have expected an error, since we > should only infer for columns _not_ in the schema. > Example code using pyarrow: > {code:python} > import io > import pyarrow as pa > from pyarrow import json > s_json = """{"column":"2022-09-05T08:08:46.000"}""" > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))])) > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > The parsing fails here because there are milliseconds and the type is "s", > but the explicit schema is ignored, and we get a result with a string column > as result: > {code} > pyarrow.Table > column: string > > column: [["2022-09-05T08:08:46.000"]] > {code} > But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the > expected parse error: > {code:python} > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))]), unexpected_field_behavior="ignore") > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > gives > {code} > ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't > parse:2022-09-05T08:08:46.000 > {code} > It might be this is specific to timestamps, I don't directly see a similar > issue with eg {{"column": "A"}} and setting the schema to "column" being > int64. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16503) [C++] Can't concatenate extension arrays
[ https://issues.apache.org/jira/browse/ARROW-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16503: --- Labels: good-second-issue pull-request-available (was: good-second-issue) > [C++] Can't concatenate extension arrays > > > Key: ARROW-16503 > URL: https://issues.apache.org/jira/browse/ARROW-16503 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Dewey Dunnington >Assignee: Joris Van den Bossche >Priority: Major > Labels: good-second-issue, pull-request-available > Fix For: 11.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > It looks like Arrays with an extension type can't be concatenated. From the R > bindings: > {code:R} > library(arrow, warn.conflicts = FALSE) > arr <- vctrs_extension_array(1:10) > concat_arrays(arr, arr) > #> Error: NotImplemented: concatenation of integer(0) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 > VisitTypeInline(*out_->type, this) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 > ConcatenateImpl(data, pool).Concatenate(_data) > {code} > This shows up more practically when using the query engine: > {code:R} > library(arrow, warn.conflicts = FALSE) > table <- arrow_table( > group = rep(c("a", "b"), 5), > col1 = 1:10, > col2 = vctrs_extension_array(1:10) > ) > tf <- tempfile() > table |> dplyr::group_by(group) |> write_dataset(tf) > open_dataset(tf) |> > dplyr::arrange(col1) |> > dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! NotImplemented: concatenation of extension > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 > VisitTypeInline(*out_->type, this) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 > ConcatenateImpl(data, pool).Concatenate(_data) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025 > Concatenate(values.chunks(), ctx->memory_pool()) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084 > TakeCA(*table.column(j), indices, options, ctx) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527 > impl_->DoFinish() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467 > iterator_.Next() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 > ReadNext() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351 > ToRecordBatches() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18081) [Go] Add Scalar Boolean Functions
[ https://issues.apache.org/jira/browse/ARROW-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-18081. --- Resolution: Fixed Issue resolved by pull request 14442 [https://github.com/apache/arrow/pull/14442] > [Go] Add Scalar Boolean Functions > - > > Key: ARROW-18081 > URL: https://issues.apache.org/jira/browse/ARROW-18081 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17954) [R] Update News for 10.0.0
[ https://issues.apache.org/jira/browse/ARROW-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-17954: Fix Version/s: 10.0.0 (was: 11.0.0) > [R] Update News for 10.0.0 > -- > > Key: ARROW-17954 > URL: https://issues.apache.org/jira/browse/ARROW-17954 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17954) [R] Update News for 10.0.0
[ https://issues.apache.org/jira/browse/ARROW-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-17954. - Fix Version/s: 11.0.0 (was: 10.0.0) Resolution: Fixed Issue resolved by pull request 14337 [https://github.com/apache/arrow/pull/14337] > [R] Update News for 10.0.0 > -- > > Key: ARROW-17954 > URL: https://issues.apache.org/jira/browse/ARROW-17954 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17871) [Go] Implement Initial Scalar Binary Arithmetic Infrastructure
[ https://issues.apache.org/jira/browse/ARROW-17871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-17871: -- Fix Version/s: 11.0.0 > [Go] Implement Initial Scalar Binary Arithmetic Infrastructure > -- > > Key: ARROW-17871 > URL: https://issues.apache.org/jira/browse/ARROW-17871 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 11h > Remaining Estimate: 0h > > Uses add, add_checked, sub, and sub_checked as the initial implementation, > only for integral types and float32/float64. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17871) [Go] Implement Initial Scalar Binary Arithmetic Infrastructure
[ https://issues.apache.org/jira/browse/ARROW-17871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-17871. --- Resolution: Fixed Issue resolved by pull request 14255 [https://github.com/apache/arrow/pull/14255] > [Go] Implement Initial Scalar Binary Arithmetic Infrastructure > -- > > Key: ARROW-17871 > URL: https://issues.apache.org/jira/browse/ARROW-17871 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Uses add, add_checked, sub, and sub_checked as the initial implementation, > only for integral types and float32/float64. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17192) [Python] .to_pandas can't read_feather if a date column contains dates before 1677 and after 2262
[ https://issues.apache.org/jira/browse/ARROW-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621174#comment-17621174 ] Alenka Frim commented on ARROW-17192: - This is said to be a known issue due to the fact that pandas, for now, only supports {{datetime64}} data type in nanosecond resolution. So when you write to a feather file the pandas dataframe gets converted to an arrow table and the conversion infers the datetime to microsecond resolution. As a workaround you can use {{feather.read_table}} to read the feather file into an Arrow table and then use {{to_pandas}} to convert it into a pandas dataframe, but you will have to add {{timestamp_as_object=True}} keyword so that PyArrow doesn't try to convert the timestamp to {{{}datetime64[ns]{}}}: {code:python} >>> feather.read_table("to_trash.feather").to_pandas(timestamp_as_object=True) date 0 1654-01-01 00:00:00 1 1920-01-01 00:00:00 {code} But I think we should still pass through {{**kwargs}} in {{read_feather}} to {{to_pandas()}} so that one could specify {{timestamp_as_object=True}} keyword there also. So I am keeping the Jira open and will try to make a PR for it in the following week. Contributions are also welcome, I can help if needed. > [Python] .to_pandas can't read_feather if a date column contains dates > before 1677 and after 2262 > -- > > Key: ARROW-17192 > URL: https://issues.apache.org/jira/browse/ARROW-17192 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Any environment >Reporter: Adrien Pacifico >Priority: Major > > A feather file with a column containing dates lower than 1677 or greater than > 2262 cannot be read with pandas, du to `.to_pandas` method. > To reproduce the issue: > {code:java} > ### create feather file > import pandas as pd > from datetime import datetime > df = pd.DataFrame({"date": [ > datetime.fromisoformat("1654-01-01"), > datetime.fromisoformat("1920-01-01"), > ],}) > df.to_feather("to_trash.feather") > ### read feather file > from pyarrow.feather import read_feather > read_feather("to_trash.feather") > {code} > > I think that the expected behavior would be to have an object column > contining datetime objects. > I think that the problem comes from _array_like_to_pandas method : > [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584] > or from `_to_pandas()` > [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742] > or from `to_pandas`: > [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"
[ https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621158#comment-17621158 ] Joris Van den Bossche commented on ARROW-18106: --- cc [~benpharkins] > [C++] JSON reader ignores explicit schema with default > unexpected_field_behavior="infer" > > > Key: ARROW-18106 > URL: https://issues.apache.org/jira/browse/ARROW-18106 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: json > > Not 100% sure this is a "bug", but at least I find it an unexpected interplay > between two options. > By default, when reading json, we _infer_ the data type of columns, and when > specifying an explicit schema, we _also_ by default infer the type of columns > that are not specified in the explicit schema. The docs for > {{unexpected_field_behavior}}: > > How JSON fields outside of explicit_schema (if given) are treated > But it seems that if you specify a schema, and the parsing of one of the > columns fails according to that schema, we still fall back to this default of > inferring the data type (while I would have expected an error, since we > should only infer for columns _not_ in the schema. > Example code using pyarrow: > {code:python} > import io > import pyarrow as pa > from pyarrow import json > s_json = """{"column":"2022-09-05T08:08:46.000"}""" > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))])) > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > The parsing fails here because there are milliseconds and the type is "s", > but the explicit schema is ignored, and we get a result with a string column > as result: > {code} > pyarrow.Table > column: string > > column: [["2022-09-05T08:08:46.000"]] > {code} > But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the > expected parse error: > {code:python} > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))]), unexpected_field_behavior="ignore") > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > gives > {code} > ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't > parse:2022-09-05T08:08:46.000 > {code} > It might be this is specific to timestamps, I don't directly see a similar > issue with eg {{"column": "A"}} and setting the schema to "column" being > int64. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16503) [C++] Can't concatenate extension arrays
[ https://issues.apache.org/jira/browse/ARROW-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-16503: - Assignee: Joris Van den Bossche > [C++] Can't concatenate extension arrays > > > Key: ARROW-16503 > URL: https://issues.apache.org/jira/browse/ARROW-16503 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Dewey Dunnington >Assignee: Joris Van den Bossche >Priority: Major > Labels: good-second-issue > > It looks like Arrays with an extension type can't be concatenated. From the R > bindings: > {code:R} > library(arrow, warn.conflicts = FALSE) > arr <- vctrs_extension_array(1:10) > concat_arrays(arr, arr) > #> Error: NotImplemented: concatenation of integer(0) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 > VisitTypeInline(*out_->type, this) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 > ConcatenateImpl(data, pool).Concatenate(_data) > {code} > This shows up more practically when using the query engine: > {code:R} > library(arrow, warn.conflicts = FALSE) > table <- arrow_table( > group = rep(c("a", "b"), 5), > col1 = 1:10, > col2 = vctrs_extension_array(1:10) > ) > tf <- tempfile() > table |> dplyr::group_by(group) |> write_dataset(tf) > open_dataset(tf) |> > dplyr::arrange(col1) |> > dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! NotImplemented: concatenation of extension > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 > VisitTypeInline(*out_->type, this) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 > ConcatenateImpl(data, pool).Concatenate(_data) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025 > Concatenate(values.chunks(), ctx->memory_pool()) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084 > TakeCA(*table.column(j), indices, options, ctx) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527 > impl_->DoFinish() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467 > iterator_.Next() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 > ReadNext() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351 > ToRecordBatches() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16503) [C++] Can't concatenate extension arrays
[ https://issues.apache.org/jira/browse/ARROW-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-16503: -- Fix Version/s: 11.0.0 > [C++] Can't concatenate extension arrays > > > Key: ARROW-16503 > URL: https://issues.apache.org/jira/browse/ARROW-16503 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Dewey Dunnington >Assignee: Joris Van den Bossche >Priority: Major > Labels: good-second-issue > Fix For: 11.0.0 > > > It looks like Arrays with an extension type can't be concatenated. From the R > bindings: > {code:R} > library(arrow, warn.conflicts = FALSE) > arr <- vctrs_extension_array(1:10) > concat_arrays(arr, arr) > #> Error: NotImplemented: concatenation of integer(0) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 > VisitTypeInline(*out_->type, this) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 > ConcatenateImpl(data, pool).Concatenate(_data) > {code} > This shows up more practically when using the query engine: > {code:R} > library(arrow, warn.conflicts = FALSE) > table <- arrow_table( > group = rep(c("a", "b"), 5), > col1 = 1:10, > col2 = vctrs_extension_array(1:10) > ) > tf <- tempfile() > table |> dplyr::group_by(group) |> write_dataset(tf) > open_dataset(tf) |> > dplyr::arrange(col1) |> > dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! NotImplemented: concatenation of extension > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 > VisitTypeInline(*out_->type, this) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 > ConcatenateImpl(data, pool).Concatenate(_data) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025 > Concatenate(values.chunks(), ctx->memory_pool()) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084 > TakeCA(*table.column(j), indices, options, ctx) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527 > impl_->DoFinish() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467 > iterator_.Next() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 > ReadNext() > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351 > ToRecordBatches() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0
[ https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621118#comment-17621118 ] Neal Richardson commented on ARROW-18102: - We could work around this in R but it seems reasonable that this should be fixed in C++. What do you think [~westonpace]? > [R] dplyr::count and dplyr::tally implementation return NA instead of 0 > --- > > Key: ARROW-18102 > URL: https://issues.apache.org/jira/browse/ARROW-18102 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0 >Reporter: Adam Black >Priority: Major > > I'm using dplyr with FileSystemDataset objects. The expected behavior is > similar (or the same as) dataframe behavior. When the FileSystemDataset has > zero rows dplyr::count and dplyr::tally return NA instead of 0. I would > expect the result to be 0. > > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > path <- tempfile(fileext = ".feather") > zero_row_dataset <- cars %>% filter(dist < 0) > # expected behavior > zero_row_dataset %>% > count() > #> n > #> 1 0 > zero_row_dataset %>% > tally() > #> n > #> 1 0 > nrow(zero_row_dataset) > #> [1] 0 > # now test behavior with a FileSystemDataset > write_feather(zero_row_dataset, path) > ds <- open_dataset(path, format = "feather") > ds > #> FileSystemDataset with 1 Feather file > #> speed: double > #> dist: double > #> > #> See $metadata for additional Schema metadata > # actual behavior > ds %>% > count() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > ds %>% > tally() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > nrow(ds) # works as expected > #> [1] 0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17954) [R] Update News for 10.0.0
[ https://issues.apache.org/jira/browse/ARROW-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621117#comment-17621117 ] Neal Richardson commented on ARROW-17954: - It's ready to merge now. > [R] Update News for 10.0.0 > -- > > Key: ARROW-17954 > URL: https://issues.apache.org/jira/browse/ARROW-17954 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18013) [C++][Python] Cannot concatenate extension arrays
[ https://issues.apache.org/jira/browse/ARROW-18013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621107#comment-17621107 ] Joris Van den Bossche commented on ARROW-18013: --- This should indeed certainly work (and shouldn't be difficult, it should "just" concatenate the storage arrays). It seems we already have another issue about this (using an R example): ARROW-16503. So will close this one as a duplicate (and will also take a look at fixing it) > [C++][Python] Cannot concatenate extension arrays > - > > Key: ARROW-18013 > URL: https://issues.apache.org/jira/browse/ARROW-18013 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 9.0.0 >Reporter: Chang She >Priority: Major > > `pa.Table.take` and `pa.ChunkedArray.combine_chunks` raises exception for > extension arrays. > https://github.com/apache/arrow/blob/apache-arrow-9.0.0/cpp/src/arrow/array/concatenate.cc#L440 > Quick example: > ``` > In [1]: import pyarrow as pa > In [2]: class LabelType(pa.ExtensionType): >...: >...: def __init__(self): >...: super(LabelType, self).__init__(pa.string(), "label") >...: >...: def __arrow_ext_serialize__(self): >...: return b"" >...: >...: @classmethod >...: def __arrow_ext_deserialize__(cls, storage_type, serialized): >...: return LabelType() >...: > In [3]: import numpy as np > In [4]: chunk1 = pa.ExtensionArray.from_storage(LabelType(), > pa.array(np.repeat('a', 1000))) > In [5]: chunk2 = pa.ExtensionArray.from_storage(LabelType(), > pa.array(np.repeat('b', 1000))) > In [6]: pa.chunked_array([chunk1, chunk2]).combine_chunks() > --- > ArrowNotImplementedError Traceback (most recent call last) > Cell In [6], line 1 > > 1 pa.chunked_array([chunk1, chunk2]).combine_chunks() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:700, in > pyarrow.lib.ChunkedArray.combine_chunks() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:2889, in > pyarrow.lib.concat_arrays() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in > pyarrow.lib.pyarrow_internal_check_status() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in > pyarrow.lib.check_status() > ArrowNotImplementedError: concatenation of extension> > ``` > Would it be possible to concatenate the storage and the "re-box" to the > ExtensionType? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18013) [C++][Python] Cannot concatenate extension arrays
[ https://issues.apache.org/jira/browse/ARROW-18013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-18013: -- Labels: extension-type (was: ) > [C++][Python] Cannot concatenate extension arrays > - > > Key: ARROW-18013 > URL: https://issues.apache.org/jira/browse/ARROW-18013 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 9.0.0 >Reporter: Chang She >Priority: Major > Labels: extension-type > > `pa.Table.take` and `pa.ChunkedArray.combine_chunks` raises exception for > extension arrays. > https://github.com/apache/arrow/blob/apache-arrow-9.0.0/cpp/src/arrow/array/concatenate.cc#L440 > Quick example: > ``` > In [1]: import pyarrow as pa > In [2]: class LabelType(pa.ExtensionType): >...: >...: def __init__(self): >...: super(LabelType, self).__init__(pa.string(), "label") >...: >...: def __arrow_ext_serialize__(self): >...: return b"" >...: >...: @classmethod >...: def __arrow_ext_deserialize__(cls, storage_type, serialized): >...: return LabelType() >...: > In [3]: import numpy as np > In [4]: chunk1 = pa.ExtensionArray.from_storage(LabelType(), > pa.array(np.repeat('a', 1000))) > In [5]: chunk2 = pa.ExtensionArray.from_storage(LabelType(), > pa.array(np.repeat('b', 1000))) > In [6]: pa.chunked_array([chunk1, chunk2]).combine_chunks() > --- > ArrowNotImplementedError Traceback (most recent call last) > Cell In [6], line 1 > > 1 pa.chunked_array([chunk1, chunk2]).combine_chunks() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:700, in > pyarrow.lib.ChunkedArray.combine_chunks() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:2889, in > pyarrow.lib.concat_arrays() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in > pyarrow.lib.pyarrow_internal_check_status() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in > pyarrow.lib.check_status() > ArrowNotImplementedError: concatenation of extension> > ``` > Would it be possible to concatenate the storage and the "re-box" to the > ExtensionType? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-18013) [C++][Python] Cannot concatenate extension arrays
[ https://issues.apache.org/jira/browse/ARROW-18013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-18013. - Resolution: Duplicate > [C++][Python] Cannot concatenate extension arrays > - > > Key: ARROW-18013 > URL: https://issues.apache.org/jira/browse/ARROW-18013 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 9.0.0 >Reporter: Chang She >Priority: Major > Labels: extension-type > > `pa.Table.take` and `pa.ChunkedArray.combine_chunks` raises exception for > extension arrays. > https://github.com/apache/arrow/blob/apache-arrow-9.0.0/cpp/src/arrow/array/concatenate.cc#L440 > Quick example: > ``` > In [1]: import pyarrow as pa > In [2]: class LabelType(pa.ExtensionType): >...: >...: def __init__(self): >...: super(LabelType, self).__init__(pa.string(), "label") >...: >...: def __arrow_ext_serialize__(self): >...: return b"" >...: >...: @classmethod >...: def __arrow_ext_deserialize__(cls, storage_type, serialized): >...: return LabelType() >...: > In [3]: import numpy as np > In [4]: chunk1 = pa.ExtensionArray.from_storage(LabelType(), > pa.array(np.repeat('a', 1000))) > In [5]: chunk2 = pa.ExtensionArray.from_storage(LabelType(), > pa.array(np.repeat('b', 1000))) > In [6]: pa.chunked_array([chunk1, chunk2]).combine_chunks() > --- > ArrowNotImplementedError Traceback (most recent call last) > Cell In [6], line 1 > > 1 pa.chunked_array([chunk1, chunk2]).combine_chunks() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:700, in > pyarrow.lib.ChunkedArray.combine_chunks() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:2889, in > pyarrow.lib.concat_arrays() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in > pyarrow.lib.pyarrow_internal_check_status() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in > pyarrow.lib.check_status() > ArrowNotImplementedError: concatenation of extension> > ``` > Would it be possible to concatenate the storage and the "re-box" to the > ExtensionType? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17317) [Release][Docs] Normalize previous document version directory
[ https://issues.apache.org/jira/browse/ARROW-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-17317. --- Resolution: Fixed Issue resolved by pull request 14457 [https://github.com/apache/arrow/pull/14457] > [Release][Docs] Normalize previous document version directory > - > > Key: ARROW-17317 > URL: https://issues.apache.org/jira/browse/ARROW-17317 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > We should use X.Y instead of X.Y.Z (e.g.: 8.0 not 8.0.1) for previous version > document directory. > See also: > https://github.com/apache/arrow/blob/apache-arrow-9.0.0/dev/release/post-08-docs.sh#L84 > The script should accept X.Y.Z such as 8.0.1 and normalize it to X.Y. It'll > reduce human error. > See also: > * https://github.com/apache/arrow-site/pull/228#issuecomment-1205997067 > * https://github.com/apache/arrow-site/pull/228#issuecomment-1206085602 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17317) [Release][Docs] Normalize previous document version directory
[ https://issues.apache.org/jira/browse/ARROW-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-17317: -- Fix Version/s: 10.0.0 (was: 11.0.0) > [Release][Docs] Normalize previous document version directory > - > > Key: ARROW-17317 > URL: https://issues.apache.org/jira/browse/ARROW-17317 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > We should use X.Y instead of X.Y.Z (e.g.: 8.0 not 8.0.1) for previous version > document directory. > See also: > https://github.com/apache/arrow/blob/apache-arrow-9.0.0/dev/release/post-08-docs.sh#L84 > The script should accept X.Y.Z such as 8.0.1 and normalize it to X.Y. It'll > reduce human error. > See also: > * https://github.com/apache/arrow-site/pull/228#issuecomment-1205997067 > * https://github.com/apache/arrow-site/pull/228#issuecomment-1206085602 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17966) [C++] Adjust to new format for Substrait optional arguments
[ https://issues.apache.org/jira/browse/ARROW-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-17966: Assignee: Ben Kietzman (was: Weston Pace) > [C++] Adjust to new format for Substrait optional arguments > --- > > Key: ARROW-17966 > URL: https://issues.apache.org/jira/browse/ARROW-17966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Substrait is presumably going to change how it defines optional arguments in > https://github.com/substrait-io/substrait/pull/342 . > This change will require a corresponding change in Acero (this should also > bring Acero in line with Ibis & Isthmus). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17966) [C++] Adjust to new format for Substrait optional arguments
[ https://issues.apache.org/jira/browse/ARROW-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-17966: Assignee: Weston Pace (was: Ben Kietzman) > [C++] Adjust to new format for Substrait optional arguments > --- > > Key: ARROW-17966 > URL: https://issues.apache.org/jira/browse/ARROW-17966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Substrait is presumably going to change how it defines optional arguments in > https://github.com/substrait-io/substrait/pull/342 . > This change will require a corresponding change in Acero (this should also > bring Acero in line with Ibis & Isthmus). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17966) [C++] Adjust to new format for Substrait optional arguments
[ https://issues.apache.org/jira/browse/ARROW-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-17966: Assignee: Weston Pace > [C++] Adjust to new format for Substrait optional arguments > --- > > Key: ARROW-17966 > URL: https://issues.apache.org/jira/browse/ARROW-17966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Substrait is presumably going to change how it defines optional arguments in > https://github.com/substrait-io/substrait/pull/342 . > This change will require a corresponding change in Acero (this should also > bring Acero in line with Ibis & Isthmus). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18107) [C++] Provide more informative error when (CSV/JSON) parsing fails
Joris Van den Bossche created ARROW-18107: - Summary: [C++] Provide more informative error when (CSV/JSON) parsing fails Key: ARROW-18107 URL: https://issues.apache.org/jira/browse/ARROW-18107 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Related to ARROW-18106 (and derived from https://stackoverflow.com/questions/74138746/why-i-cant-parse-timestamp-in-pyarrow). Assume you have the following code to read a JSON file with timestamps. The timestamps have a sub-second part in their string, which fails parsing if you specify it as second resolution timestamp: {code:python} import io from pyarrow import json s_json = """{"column":"2022-09-05T08:08:46.000"}""" opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore") json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) {code} gives: {code} ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000 {code} This error is expected, but I think it could be more informative about the reason why it failed parsing (because at first sight it looks like a proper timestamp string, so you might be left wondering why this is failing). (this might not be that straightforward, though, since there can be many reasons why the parsing is failing) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17192) [Python] .to_pandas can't read_feather if a date column contains dates before 1677 and after 2262
[ https://issues.apache.org/jira/browse/ARROW-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim updated ARROW-17192: Description: A feather file with a column containing dates lower than 1677 or greater than 2262 cannot be read with pandas, du to `.to_pandas` method. To reproduce the issue: {code:java} ### create feather file import pandas as pd from datetime import datetime df = pd.DataFrame({"date": [ datetime.fromisoformat("1654-01-01"), datetime.fromisoformat("1920-01-01"), ],}) df.to_feather("to_trash.feather") ### read feather file from pyarrow.feather import read_feather read_feather("to_trash.feather") {code} I think that the expected behavior would be to have an object column contining datetime objects. I think that the problem comes from _array_like_to_pandas method : [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584] or from `_to_pandas()` [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742] or from `to_pandas`: [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673] was: A feather file with a column containing dates lower than 1677 or greater than 2262 cannot be read with pandas, du to `.to_pandas` method. To reproduce the issue: {code:java} ### create feather file import pandas as pd from datetime import datetime df = pd.DataFrame({"date": [ datetime.fromisoformat("1654-01-01"), datetime.fromisoformat("1920-01-01"), ],}) df.to_feather("to_trash.feather") ### read feather file from pyarrow.feather import read_feather read_feather("to_trash.feather") {code} I think that the expected behavior would be to have an object column contining datetime objects. I think that the problem comes from _array_like_to_pandas method : [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584] or from `_to_pandas()` [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742] or from `to_pandas`: [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673] > [Python] .to_pandas can't read_feather if a date column contains dates > before 1677 and after 2262 > -- > > Key: ARROW-17192 > URL: https://issues.apache.org/jira/browse/ARROW-17192 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Any environment >Reporter: Adrien Pacifico >Priority: Major > > A feather file with a column containing dates lower than 1677 or greater than > 2262 cannot be read with pandas, du to `.to_pandas` method. > To reproduce the issue: > {code:java} > ### create feather file > import pandas as pd > from datetime import datetime > df = pd.DataFrame({"date": [ > datetime.fromisoformat("1654-01-01"), > datetime.fromisoformat("1920-01-01"), > ],}) > df.to_feather("to_trash.feather") > ### read feather file > from pyarrow.feather import read_feather > read_feather("to_trash.feather") > {code} > > I think that the expected behavior would be to have an object column > contining datetime objects. > I think that the problem comes from _array_like_to_pandas method : > [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L1584] > or from `_to_pandas()` > [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L2742] > or from `to_pandas`: > [https://github.com/apache/arrow/blob/76f45a6892b13391fdede4c72934f75f6d56143c/python/pyarrow/array.pxi#L673] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"
Joris Van den Bossche created ARROW-18106: - Summary: [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer" Key: ARROW-18106 URL: https://issues.apache.org/jira/browse/ARROW-18106 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Not 100% sure this is a "bug", but at least I find it an unexpected interplay between two options. By default, when reading json, we _infer_ the data type of columns, and when specifying an explicit schema, we _also_ by default infer the type of columns that are not specified in the explicit schema. The docs for {{unexpected_field_behavior}}: > How JSON fields outside of explicit_schema (if given) are treated But it seems that if you specify a schema, and the parsing of one of the columns fails according to that schema, we still fall back to this default of inferring the data type (while I would have expected an error, since we should only infer for columns _not_ in the schema. Example code using pyarrow: {code:python} import io import pyarrow as pa from pyarrow import json s_json = """{"column":"2022-09-05T08:08:46.000"}""" opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))])) json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) {code} The parsing fails here because there are milliseconds and the type is "s", but the explicit schema is ignored, and we get a result with a string column as result: {code} pyarrow.Table column: string column: [["2022-09-05T08:08:46.000"]] {code} But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the expected parse error: {code:python} opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore") json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) {code} gives {code} ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000 {code} It might be this is specific to timestamps, I don't directly see a similar issue with eg {{"column": "A"}} and setting the schema to "column" being int64. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17308) ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API
[ https://issues.apache.org/jira/browse/ARROW-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim closed ARROW-17308. --- Resolution: Duplicate > ValueError: Keyword 'validate_schema' is not yet supported with the new > Dataset API > --- > > Key: ARROW-17308 > URL: https://issues.apache.org/jira/browse/ARROW-17308 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Abderrahmane Jaidi >Priority: Major > Labels: dataset-parquet-legacy, dataset-parquet-read > > Documentation for PyArrow 6.x and 7.x both indicate that the > `validate_schema` argument is supported in the `ParquetDataset` class. Yet > passing that argument to an instance results in: > ValueError: Keyword 'validate_schema' is not yet supported with the new > Dataset API > Code: > {code:python} > parquet_dataset = pyarrow.parquet.ParquetDataset( > path_or_paths=paths, > validate_schema=validate_schema, > filesystem=filesystem, > partitioning=partitioning, > use_legacy_dataset=False, > ){code} > Docs link: > [https://arrow.apache.org/docs/6.0/python/generated/pyarrow.parquet.ParquetDataset.html] > [https://arrow.apache.org/docs/7.0/python/generated/pyarrow.parquet.ParquetDataset.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17308) ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API
[ https://issues.apache.org/jira/browse/ARROW-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621081#comment-17621081 ] Alenka Frim commented on ARROW-17308: - Maybe it is worth mentioning that even if the {{validate_schema}} argument is not supplied, the Arrow C++ implementation validates the schema for all the fragments of the dataset and returns a validation error in case of mismatch. > ValueError: Keyword 'validate_schema' is not yet supported with the new > Dataset API > --- > > Key: ARROW-17308 > URL: https://issues.apache.org/jira/browse/ARROW-17308 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Abderrahmane Jaidi >Priority: Major > Labels: dataset-parquet-legacy, dataset-parquet-read > > Documentation for PyArrow 6.x and 7.x both indicate that the > `validate_schema` argument is supported in the `ParquetDataset` class. Yet > passing that argument to an instance results in: > ValueError: Keyword 'validate_schema' is not yet supported with the new > Dataset API > Code: > {code:python} > parquet_dataset = pyarrow.parquet.ParquetDataset( > path_or_paths=paths, > validate_schema=validate_schema, > filesystem=filesystem, > partitioning=partitioning, > use_legacy_dataset=False, > ){code} > Docs link: > [https://arrow.apache.org/docs/6.0/python/generated/pyarrow.parquet.ParquetDataset.html] > [https://arrow.apache.org/docs/7.0/python/generated/pyarrow.parquet.ParquetDataset.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17200) [Python][Parquet] support partitioning by Pandas DataFrame index
[ https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim closed ARROW-17200. --- Resolution: Invalid > [Python][Parquet] support partitioning by Pandas DataFrame index > > > Key: ARROW-17200 > URL: https://issues.apache.org/jira/browse/ARROW-17200 > Project: Apache Arrow > Issue Type: New Feature > Components: Parquet, Python >Reporter: Gregory Werbin >Priority: Minor > > In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" > index level, one might want to partition by that index level when saving the > data frame to Parquet format. This is currently not possible; you need to > manually reset the index before writing, and re-add the index after reading. > It would be very useful if you could supply the name of an index level to > {{partition_cols}} instead of (or ideally in addition to) a data column name. > I originally posted this on the Pandas issue tracker > ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke > looked at the code and figured out that the partitioning functionality was > implemented entirely in PyArrow, and that the change would need to happen > within PyArrow itself. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17200) [Python][Parquet] support partitioning by Pandas DataFrame index
[ https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621068#comment-17621068 ] Alenka Frim commented on ARROW-17200: - This should be possible. When transforming pandas dataframe into Arrow table the multi-index is converted into columns. These columns can then be defined as {{partition_cols}} for writing parquet files into partitions. Also looking at the code in pandas codebase, the correct method is selected if {{partition_cols}} are supplied: [https://github.com/pandas-dev/pandas/blob/56d82a9bd654e91d14596e82e4d9c82215fa5bc8/pandas/io/parquet.py#L195-L209] which is {{write_to_dataset}}. A working example: {code:python} import pandas as pd import numpy as np # Creating a dataframe with MultiIndex arrays = [ ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ["one", "two", "one", "two", "one", "two", "one", "two"], ] tuples = list(zip(*arrays)) index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"]) df = pd.DataFrame(data={'randn': np.random.randn(8)}, index=index) # writing to a partitioned dataset df.to_parquet(path='dataset_name', partition_cols=["first", "second"]) # inspecting the pieces import pyarrow.parquet as pq dataset = pq.ParquetDataset('dataset_name', use_legacy_dataset=False) dataset.fragments # [, # , # , # , # , # , # , # , # , # , # , # , # , # , # , # ] {code} > [Python][Parquet] support partitioning by Pandas DataFrame index > > > Key: ARROW-17200 > URL: https://issues.apache.org/jira/browse/ARROW-17200 > Project: Apache Arrow > Issue Type: New Feature > Components: Parquet, Python >Reporter: Gregory Werbin >Priority: Minor > > In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" > index level, one might want to partition by that index level when saving the > data frame to Parquet format. This is currently not possible; you need to > manually reset the index before writing, and re-add the index after reading. > It would be very useful if you could supply the name of an index level to > {{partition_cols}} instead of (or ideally in addition to) a data column name. > I originally posted this on the Pandas issue tracker > ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke > looked at the code and figured out that the partitioning functionality was > implemented entirely in PyArrow, and that the change would need to happen > within PyArrow itself. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18064) [Python] Error of wrong number of rows read from Parquet file
[ https://issues.apache.org/jira/browse/ARROW-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-18064: -- Summary: [Python] Error of wrong number of rows read from Parquet file (was: [Python] Error of wrong number of rows read from file) > [Python] Error of wrong number of rows read from Parquet file > - > > Key: ARROW-18064 > URL: https://issues.apache.org/jira/browse/ARROW-18064 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 7.0.0, 7.0.1, 8.0.0, 8.0.1, 9.0.0 > Environment: Python Info > 3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit > (AMD64)] > Pyarrow Info > 6.0.1 > Platform Info > Windows-10-10.0.19042-SP0 > Windows > 10 > 10.0.19042 > 19042 > AMD64 >Reporter: Blake erickson >Priority: Major > Attachments: badplug.parquet, readBadParquet.py, screenshot-1.png > > > on version greater than 6.0.1 fail to read tables saying expected length n, > got n=1 rows > > Tables can be read column by column fine, or with a fixed number of rows > matching the meta data fine. Reads correctly in version 6.0.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
[ https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621059#comment-17621059 ] Joris Van den Bossche commented on ARROW-17360: --- For reference, we had a similar issue for Feather, where the underlying C++ reader always follows the order of the schema (ARROW-8641). And there we solved this by reordering the columns on the Python side in {{pyarrow.feather.read_table}} (as Alenka linked above). > [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns > --- > > Key: ARROW-17360 > URL: https://issues.apache.org/jira/browse/ARROW-17360 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 8.0.1 >Reporter: Matthew Roeschke >Priority: Major > Labels: orc > > xref [https://github.com/pandas-dev/pandas/issues/47944] > > {code:java} > In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) > # pandas main branch / 1.5 > In [2]: df.to_orc("abc") > In [3]: pd.read_orc("abc", columns=['b', 'a']) > Out[3]: >a b > 0 1 a > 1 2 b > 2 3 c > In [4]: import pyarrow.orc as orc > In [5]: orc_file = orc.ORCFile("abc") > # reordered to a, b > In [6]: orc_file.read(columns=['b', 'a']).to_pandas() > Out[6]: >a b > 0 1 a > 1 2 b > 2 3 c > # reordered to a, b > In [7]: orc_file.read(columns=['b', 'a']) > Out[7]: > pyarrow.Table > a: int64 > b: string > > a: [[1,2,3]] > b: [["a","b","c"]] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17540) [Python] Can not refer to field in a list of structs
[ https://issues.apache.org/jira/browse/ARROW-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-17540: -- Fix Version/s: 11.0.0 > [Python] Can not refer to field in a list of structs > - > > Key: ARROW-17540 > URL: https://issues.apache.org/jira/browse/ARROW-17540 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Lei (Eddy) Xu >Priority: Major > Fix For: 11.0.0 > > > When the dataset has nested sturcts, "list", we can not use > `pyarrow.field(..)` to get the reference of the sub-field of the struct. > > For example > > {code:python} > import pyarrow as pa > import pyarrow.dataset as ds > import pandas as pd > schema = pa.schema( > [ > pa.field( > "objects", > pa.list_( > pa.struct( > [ > pa.field("name", pa.utf8()), > pa.field("attr1", pa.float32()), > pa.field("attr2", pa.int32()), > ] > ) > ), > ) > ] > ) > table = pa.Table.from_pandas( > pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}]) > ) > print(table) > dataset = ds.dataset(table) > print(dataset) > dataset.scanner(columns=["objects.attr2"]).to_table() > {code} > which throws exception: > {noformat} > Traceback (most recent call last): > File "foo.py", line 31, in > dataset.scanner(columns=["objects.attr2"]).to_table() > File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner > File "pyarrow/_dataset.pyx", line 2356, in > pyarrow._dataset.Scanner.from_dataset > File "pyarrow/_dataset.pyx", line 2202, in > pyarrow._dataset._populate_builder > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in > objects: list> > __fragment_index: int32 > __batch_index: int32 > __last_in_fragment: bool > __filename: string > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18105) Arrow Flight SegFault
[ https://issues.apache.org/jira/browse/ARROW-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621048#comment-17621048 ] David Li commented on ARROW-18105: -- Duplicate of ARROW-17822? > Arrow Flight SegFault > - > > Key: ARROW-18105 > URL: https://issues.apache.org/jira/browse/ARROW-18105 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC >Affects Versions: 9.0.0 >Reporter: Ziheng Wang >Priority: Minor > > Typo in grpc endpoint results in segfault. Probably should result in warning > instead. > ziheng@ziheng:~$ python3 > Python 3.8.0 (default, Nov 6 2019, 21:49:08) > [GCC 7.3.0] :: Anaconda, Inc. on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow.flight > >>> flight_client = pyarrow.flight.connect("grcp://0.0.0.0:5005") -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17611) [Rust] Boolean column data saved with V2 from arrow-rs unreadable by pyarrow
[ https://issues.apache.org/jira/browse/ARROW-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miles Granger updated ARROW-17611: -- Summary: [Rust] Boolean column data saved with V2 from arrow-rs unreadable by pyarrow (was: Boolean column data saved with V2 from arrow-rs unreadable by pyarrow) > [Rust] Boolean column data saved with V2 from arrow-rs unreadable by pyarrow > > > Key: ARROW-17611 > URL: https://issues.apache.org/jira/browse/ARROW-17611 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 9.0.0 > Environment: Rust: > "arrow" = "21.0.0" > "parquet" = "21.0.0" > Python: > parquet-tools 0.2.11 > pyarrow 9.0.0 >Reporter: Kamil Skalski >Priority: Minor > Attachments: arrow_boolean.tar.gz, main.rs, x.parquet > > > I'm generating Parquet V2 files with boolean column, but when trying to read > them with pyarrow ({_}parquet-tool{_}s or {_}parq{_}) I'm getting > {code:java} > OSError: Unknown encoding type. {code} > To reproduce run following Rust program: > {code:java} > use arrow::json; > use std::fs::File; > const DATA: &'static str = r#" > {"x": 1, "y": false} > "#; > fn main() -> anyhow::Result<()> { > let mut json = json::ReaderBuilder::new().infer_schema(Some(2)) > .build(std::io::Cursor::new(DATA.as_bytes()))?; >let batch = json.next()?.unwrap(); >let out_file = File::create("x.parquet")?; > let props = parquet::file::properties::WriterProperties::builder() > .set_writer_version( > parquet::file::properties::WriterVersion::PARQUET_2_0) > .build(); > let mut writer = parquet::arrow::ArrowWriter::try_new( > out_file, batch.schema(), Some(props))?; > writer.write()?; > writer.close()?; > Ok(()) > } {code} > and try to show the output _x.parquet_ file > {code:java} > $ cargo run > $ parquet-tools show x.parquet > Traceback (most recent call last): > File "/home/nazgul/.local/bin/parquet-tools", line 8, in > sys.exit(main()) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line > 26, in main > args.handler(args) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/show.py", > line 59, in _cli > with get_datafame_from_objs(pfs, args.head) as df: > File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ > return next(self.gen) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", > line 190, in get_datafame_from_objs > df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe()) > File "/usr/lib/python3.10/contextlib.py", line 492, in enter_context > result = _cm_type.__enter__(cm) > File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ > return next(self.gen) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", > line 71, in get_dataframe > yield pq.read_table(local_path).to_pandas() > File > "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 2827, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 2473, in read > table = self._dataset.to_table( > File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > OSError: Unknown encoding type. {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17611) Boolean column data saved with V2 from arrow-rs unreadable by pyarrow
[ https://issues.apache.org/jira/browse/ARROW-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621039#comment-17621039 ] Miles Granger edited comment on ARROW-17611 at 10/20/22 11:42 AM: -- It seems `arrow-rs` defaults to RLE for the boolean array; but that can be changed using a different encoding thru [set_encoding|https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_encoding] on the Builder. Other good news, is pyarrow 10.x should (and does work, checking on master now) read the given file, thanks to ARROW-18031 being closed. --- EDIT: If this answers the question/issue, can we close this? was (Author: JIRAUSER293894): It seems `arrow-rs` defaults to RLE for the boolean array; but that can be changed using a different encoding thru [set_encoding|https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_encoding] on the Builder. Other good news, is pyarrow 10.x should (and does work, checking on master now) read the given file, thanks to ARROW-18031 being closed. > Boolean column data saved with V2 from arrow-rs unreadable by pyarrow > - > > Key: ARROW-17611 > URL: https://issues.apache.org/jira/browse/ARROW-17611 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 9.0.0 > Environment: Rust: > "arrow" = "21.0.0" > "parquet" = "21.0.0" > Python: > parquet-tools 0.2.11 > pyarrow 9.0.0 >Reporter: Kamil Skalski >Priority: Minor > Attachments: arrow_boolean.tar.gz, main.rs, x.parquet > > > I'm generating Parquet V2 files with boolean column, but when trying to read > them with pyarrow ({_}parquet-tool{_}s or {_}parq{_}) I'm getting > {code:java} > OSError: Unknown encoding type. {code} > To reproduce run following Rust program: > {code:java} > use arrow::json; > use std::fs::File; > const DATA: &'static str = r#" > {"x": 1, "y": false} > "#; > fn main() -> anyhow::Result<()> { > let mut json = json::ReaderBuilder::new().infer_schema(Some(2)) > .build(std::io::Cursor::new(DATA.as_bytes()))?; >let batch = json.next()?.unwrap(); >let out_file = File::create("x.parquet")?; > let props = parquet::file::properties::WriterProperties::builder() > .set_writer_version( > parquet::file::properties::WriterVersion::PARQUET_2_0) > .build(); > let mut writer = parquet::arrow::ArrowWriter::try_new( > out_file, batch.schema(), Some(props))?; > writer.write()?; > writer.close()?; > Ok(()) > } {code} > and try to show the output _x.parquet_ file > {code:java} > $ cargo run > $ parquet-tools show x.parquet > Traceback (most recent call last): > File "/home/nazgul/.local/bin/parquet-tools", line 8, in > sys.exit(main()) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line > 26, in main > args.handler(args) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/show.py", > line 59, in _cli > with get_datafame_from_objs(pfs, args.head) as df: > File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ > return next(self.gen) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", > line 190, in get_datafame_from_objs > df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe()) > File "/usr/lib/python3.10/contextlib.py", line 492, in enter_context > result = _cm_type.__enter__(cm) > File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ > return next(self.gen) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", > line 71, in get_dataframe > yield pq.read_table(local_path).to_pandas() > File > "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 2827, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 2473, in read > table = self._dataset.to_table( > File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > OSError: Unknown encoding type. {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17611) Boolean column data saved with V2 from arrow-rs unreadable by pyarrow
[ https://issues.apache.org/jira/browse/ARROW-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621039#comment-17621039 ] Miles Granger commented on ARROW-17611: --- It seems `arrow-rs` defaults to RLE for the boolean array; but that can be changed using a different encoding thru [set_encoding|https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_encoding] on the Builder. Other good news, is pyarrow 10.x should (and does work, checking on master now) read the given file, thanks to ARROW-18031 being closed. > Boolean column data saved with V2 from arrow-rs unreadable by pyarrow > - > > Key: ARROW-17611 > URL: https://issues.apache.org/jira/browse/ARROW-17611 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 9.0.0 > Environment: Rust: > "arrow" = "21.0.0" > "parquet" = "21.0.0" > Python: > parquet-tools 0.2.11 > pyarrow 9.0.0 >Reporter: Kamil Skalski >Priority: Minor > Attachments: arrow_boolean.tar.gz, main.rs, x.parquet > > > I'm generating Parquet V2 files with boolean column, but when trying to read > them with pyarrow ({_}parquet-tool{_}s or {_}parq{_}) I'm getting > {code:java} > OSError: Unknown encoding type. {code} > To reproduce run following Rust program: > {code:java} > use arrow::json; > use std::fs::File; > const DATA: &'static str = r#" > {"x": 1, "y": false} > "#; > fn main() -> anyhow::Result<()> { > let mut json = json::ReaderBuilder::new().infer_schema(Some(2)) > .build(std::io::Cursor::new(DATA.as_bytes()))?; >let batch = json.next()?.unwrap(); >let out_file = File::create("x.parquet")?; > let props = parquet::file::properties::WriterProperties::builder() > .set_writer_version( > parquet::file::properties::WriterVersion::PARQUET_2_0) > .build(); > let mut writer = parquet::arrow::ArrowWriter::try_new( > out_file, batch.schema(), Some(props))?; > writer.write()?; > writer.close()?; > Ok(()) > } {code} > and try to show the output _x.parquet_ file > {code:java} > $ cargo run > $ parquet-tools show x.parquet > Traceback (most recent call last): > File "/home/nazgul/.local/bin/parquet-tools", line 8, in > sys.exit(main()) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line > 26, in main > args.handler(args) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/show.py", > line 59, in _cli > with get_datafame_from_objs(pfs, args.head) as df: > File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ > return next(self.gen) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", > line 190, in get_datafame_from_objs > df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe()) > File "/usr/lib/python3.10/contextlib.py", line 492, in enter_context > result = _cm_type.__enter__(cm) > File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ > return next(self.gen) > File > "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", > line 71, in get_dataframe > yield pq.read_table(local_path).to_pandas() > File > "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 2827, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 2473, in read > table = self._dataset.to_table( > File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > OSError: Unknown encoding type. {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17540) [Python] Can not refer to field in a list of structs
[ https://issues.apache.org/jira/browse/ARROW-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621037#comment-17621037 ] Miles Granger commented on ARROW-17540: --- Should also mention, that if you are only after a single list element, you can do the following, albeit ugly, bit of code here. Until it's properly fixed. {code:python} dataset.to_table(columns={ 'attr2': pc.struct_field( pc.list_element(ds.field("objects"), ds.scalar(0)), [1]) } ) {code} > [Python] Can not refer to field in a list of structs > - > > Key: ARROW-17540 > URL: https://issues.apache.org/jira/browse/ARROW-17540 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Lei (Eddy) Xu >Priority: Major > > When the dataset has nested sturcts, "list", we can not use > `pyarrow.field(..)` to get the reference of the sub-field of the struct. > > For example > > {code:python} > import pyarrow as pa > import pyarrow.dataset as ds > import pandas as pd > schema = pa.schema( > [ > pa.field( > "objects", > pa.list_( > pa.struct( > [ > pa.field("name", pa.utf8()), > pa.field("attr1", pa.float32()), > pa.field("attr2", pa.int32()), > ] > ) > ), > ) > ] > ) > table = pa.Table.from_pandas( > pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}]) > ) > print(table) > dataset = ds.dataset(table) > print(dataset) > dataset.scanner(columns=["objects.attr2"]).to_table() > {code} > which throws exception: > {noformat} > Traceback (most recent call last): > File "foo.py", line 31, in > dataset.scanner(columns=["objects.attr2"]).to_table() > File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner > File "pyarrow/_dataset.pyx", line 2356, in > pyarrow._dataset.Scanner.from_dataset > File "pyarrow/_dataset.pyx", line 2202, in > pyarrow._dataset._populate_builder > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in > objects: list> > __fragment_index: int32 > __batch_index: int32 > __last_in_fragment: bool > __filename: string > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
[ https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621017#comment-17621017 ] Alenka Frim edited comment on ARROW-17360 at 10/20/22 11:21 AM: Thank you for reporting! I would say this is not the expected behaviour. If we look at the {{parquet}} or {{feather}} format the {{read}} methods preserve the ordering of selected columns: {code:python} import pyarrow as pa table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]}) import pyarrow.parquet as pq pq.write_table(table, 'example.parquet') pq.read_table('example.parquet', columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] import pyarrow.feather as feather feather.write_feather(table, 'example_feather') feather.read_table('example_feather', columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] {code} FWIU looking at the code in [pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382] and [arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341] I think the behaviour comes from [Apache ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165] and can therefore be open as an issue there (about following order in the original schema). Nevertheless there are two options we have to make this work correctly: * add a re-ordering in {{pyarrow}} as it is done for [feather implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281]. * Even better would be if {{pandas}} uses the new {{dataset}} API to read {{orc}} files like so: {code:python} import pyarrow.dataset as ds dataset = ds.dataset("example.orc", format="orc") dataset.to_table(columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] {code} was (Author: alenkaf): Thank you for reporting! I would say this is not the expected behaviour. If we look at the {{parquet}} or {{feather}} format the {{read}} methods preserve the ordering of selected columns: {code:python} import pyarrow as pa table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]}) import pyarrow.parquet as pq pq.write_table(table, 'example.parquet') pq.read_table('example.parquet', columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] import pyarrow.feather as feather feather.write_feather(table, 'example_feather') feather.read_table('example_feather', columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] {code} FWIU looking at the code in [pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382] and [arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341] I think the behaviour comes from [Apache ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165] and can therefore be open as an issue there. Nevertheless there are two options we have to make this work correctly: * add a re-ordering in {{pyarrow}} as it is done for [feather implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281]. * Even better would be if {{pandas}} uses the new {{dataset}} API to read {{orc}} files like so: {code:python} import pyarrow.dataset as ds dataset = ds.dataset("example.orc", format="orc") dataset.to_table(columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] {code} > [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns > --- > > Key: ARROW-17360 > URL: https://issues.apache.org/jira/browse/ARROW-17360 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 8.0.1 >Reporter: Matthew Roeschke >Priority: Major > Labels: orc > > xref [https://github.com/pandas-dev/pandas/issues/47944] > > {code:java} > In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) > # pandas main branch / 1.5 > In [2]: df.to_orc("abc") > In [3]: pd.read_orc("abc", columns=['b', 'a']) > Out[3]: >a b > 0 1 a > 1 2 b > 2 3 c > In [4]: import pyarrow.orc as orc > In [5]: orc_file = orc.ORCFile("abc") > # reordered to a, b > In [6]: orc_file.read(columns=['b', 'a']).to_pandas() > Out[6]: >
[jira] [Commented] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
[ https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621017#comment-17621017 ] Alenka Frim commented on ARROW-17360: - Thank you for reporting! I would say this is not the expected behaviour. If we look at the {{parquet}} or {{feather}} format the {{read}} methods preserve the ordering of selected columns: {code:python} import pyarrow as pa table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]}) import pyarrow.parquet as pq pq.write_table(table, 'example.parquet') pq.read_table('example.parquet', columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] import pyarrow.feather as feather feather.write_feather(table, 'example_feather') feather.read_table('example_feather', columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] {code} FWIU looking at the code in [pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382] and [arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341] I think the behaviour comes from [Apache ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165] and can therefore be open as an issue there. Nevertheless there are two options we have to make this work correctly: * add a re-ordering in {{pyarrow}} as it is done for [feather implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281]. * Even better would be if {{pandas}} uses the new {{dataset}} API to read {{orc}} files like so: {code:python} import pyarrow.dataset as ds dataset = ds.dataset("example.orc", format="orc") dataset.to_table(columns=['b', 'a']) # pyarrow.Table # b: string # a: int64 # # b: [["a","b","c"]] # a: [[1,2,3]] {code} > [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns > --- > > Key: ARROW-17360 > URL: https://issues.apache.org/jira/browse/ARROW-17360 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 8.0.1 >Reporter: Matthew Roeschke >Priority: Major > Labels: orc > > xref [https://github.com/pandas-dev/pandas/issues/47944] > > {code:java} > In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) > # pandas main branch / 1.5 > In [2]: df.to_orc("abc") > In [3]: pd.read_orc("abc", columns=['b', 'a']) > Out[3]: >a b > 0 1 a > 1 2 b > 2 3 c > In [4]: import pyarrow.orc as orc > In [5]: orc_file = orc.ORCFile("abc") > # reordered to a, b > In [6]: orc_file.read(columns=['b', 'a']).to_pandas() > Out[6]: >a b > 0 1 a > 1 2 b > 2 3 c > # reordered to a, b > In [7]: orc_file.read(columns=['b', 'a']) > Out[7]: > pyarrow.Table > a: int64 > b: string > > a: [[1,2,3]] > b: [["a","b","c"]] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18064) [Python] Error of wrong number of rows read from file
[ https://issues.apache.org/jira/browse/ARROW-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miles Granger updated ARROW-18064: -- Summary: [Python] Error of wrong number of rows read from file (was: Error of wrong number of rows read from file) > [Python] Error of wrong number of rows read from file > - > > Key: ARROW-18064 > URL: https://issues.apache.org/jira/browse/ARROW-18064 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 7.0.0, 7.0.1, 8.0.0, 8.0.1, 9.0.0 > Environment: Python Info > 3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit > (AMD64)] > Pyarrow Info > 6.0.1 > Platform Info > Windows-10-10.0.19042-SP0 > Windows > 10 > 10.0.19042 > 19042 > AMD64 >Reporter: Blake erickson >Priority: Major > Attachments: badplug.parquet, readBadParquet.py, screenshot-1.png > > > on version greater than 6.0.1 fail to read tables saying expected length n, > got n=1 rows > > Tables can be read column by column fine, or with a fixed number of rows > matching the meta data fine. Reads correctly in version 6.0.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18099) [Python] Cannot create pandas categorical from table only with nulls
[ https://issues.apache.org/jira/browse/ARROW-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim updated ARROW-18099: Summary: [Python] Cannot create pandas categorical from table only with nulls (was: Cannot create pandas categorical from table only with nulls) > [Python] Cannot create pandas categorical from table only with nulls > > > Key: ARROW-18099 > URL: https://issues.apache.org/jira/browse/ARROW-18099 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 > Environment: OSX 12.6 > M1 silicon >Reporter: Damian Barabonkov >Priority: Minor > > A pyarrow Table with only null values cannot be instantiated as a Pandas > DataFrame with said column as a category. However, pandas does support > "empty" categoricals. Therefore, a simple patch would be to load the pa.Table > as an object first and convert, once in pandas, to a categorical which will > be empty. However, that does not solve the pyarrow bug at its root. > > Sample reproducible example > {code:java} > import pyarrow as pa > pylist = [{'x': None, '__index_level_0__': 2}, {'x': None, > '__index_level_0__': 3}] > tbl = pa.Table.from_pylist(pylist) > > # Errors > df_broken = tbl.to_pandas(categories=["x"]) > > # Works > df_works = tbl.to_pandas() > df_works = df_works.astype({"x": "category"}) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0
[ https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620931#comment-17620931 ] Nicola Crane commented on ARROW-18102: -- Thanks for opening this ticket [~adam.black]. I've tried this with the dev version of Arrow, and can confirm this bug still exists there too. > [R] dplyr::count and dplyr::tally implementation return NA instead of 0 > --- > > Key: ARROW-18102 > URL: https://issues.apache.org/jira/browse/ARROW-18102 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0 >Reporter: Adam Black >Priority: Minor > > I'm using dplyr with FileSystemDataset objects. The expected behavior is > similar (or the same as) dataframe behavior. When the FileSystemDataset has > zero rows dplyr::count and dplyr::tally return NA instead of 0. I would > expect the result to be 0. > > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > path <- tempfile(fileext = ".feather") > zero_row_dataset <- cars %>% filter(dist < 0) > # expected behavior > zero_row_dataset %>% > count() > #> n > #> 1 0 > zero_row_dataset %>% > tally() > #> n > #> 1 0 > nrow(zero_row_dataset) > #> [1] 0 > # now test behavior with a FileSystemDataset > write_feather(zero_row_dataset, path) > ds <- open_dataset(path, format = "feather") > ds > #> FileSystemDataset with 1 Feather file > #> speed: double > #> dist: double > #> > #> See $metadata for additional Schema metadata > # actual behavior > ds %>% > count() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > ds %>% > tally() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > nrow(ds) # works as expected > #> [1] 0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18102) [R] dplyr::count and dplyr::tally implementation return NA instead of 0
[ https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-18102: - Priority: Major (was: Minor) > [R] dplyr::count and dplyr::tally implementation return NA instead of 0 > --- > > Key: ARROW-18102 > URL: https://issues.apache.org/jira/browse/ARROW-18102 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0 >Reporter: Adam Black >Priority: Major > > I'm using dplyr with FileSystemDataset objects. The expected behavior is > similar (or the same as) dataframe behavior. When the FileSystemDataset has > zero rows dplyr::count and dplyr::tally return NA instead of 0. I would > expect the result to be 0. > > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > path <- tempfile(fileext = ".feather") > zero_row_dataset <- cars %>% filter(dist < 0) > # expected behavior > zero_row_dataset %>% > count() > #> n > #> 1 0 > zero_row_dataset %>% > tally() > #> n > #> 1 0 > nrow(zero_row_dataset) > #> [1] 0 > # now test behavior with a FileSystemDataset > write_feather(zero_row_dataset, path) > ds <- open_dataset(path, format = "feather") > ds > #> FileSystemDataset with 1 Feather file > #> speed: double > #> dist: double > #> > #> See $metadata for additional Schema metadata > # actual behavior > ds %>% > count() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > ds %>% > tally() %>% > collect() # incorrect result > #> # A tibble: 1 × 1 > #> n > #> > #> 1 NA > nrow(ds) # works as expected > #> [1] 0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)