[jira] [Resolved] (ARROW-15709) [C++] Compilation of ARROW_ENGINE fails if doing an "inline" build
[ https://issues.apache.org/jira/browse/ARROW-15709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-15709. -- Resolution: Fixed Issue resolved by pull request 12457 [https://github.com/apache/arrow/pull/12457] > [C++] Compilation of ARROW_ENGINE fails if doing an "inline" build > -- > > Key: ARROW-15709 > URL: https://issues.apache.org/jira/browse/ARROW-15709 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Jeroen van Straten >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 8.5h > Remaining Estimate: 0h > > Typically with cmake we create a dedicated build directory. > {noformat} > cd cpp > mkdir build > cd build > cmake .. -DARROW_ENGINE=ON > {noformat} > However, it is possible to do an "inline" build: > {noformat} > cd cpp > cmake . -DARROW_ENGINE=ON > {noformat} > In the latter case we end up with a compilation error because when we clone > the substrait repo we clobber our substrait source files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15625) [C++] Convert underscores to hyphens in example executable names too
[ https://issues.apache.org/jira/browse/ARROW-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai reassigned ARROW-15625: Assignee: Yibo Cai > [C++] Convert underscores to hyphens in example executable names too > > > Key: ARROW-15625 > URL: https://issues.apache.org/jira/browse/ARROW-15625 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > > While test executable names use dashes, examples use underscores. We could > convert them to be consistent, like what was done in ARROW-4648. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform
[ https://issues.apache.org/jira/browse/ARROW-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved ARROW-3476. - Fix Version/s: 3.0.0 Resolution: Fixed This is already solved since Arrow 3.0 or later support big-endian in C++ binding. > [Java] mvn test in memory fails on a big-endian platform > > > Key: ARROW-3476 > URL: https://issues.apache.org/jira/browse/ARROW-3476 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Kazuaki Ishizaki >Priority: Major > Fix For: 3.0.0 > > > Apache Arrow is becoming commonplace to exchange data among important > emerging analytics frameworks such as Pandas, Numpy, and Spark. > [IBM Z|https://en.wikipedia.org/wiki/IBM_Z] is one of platforms to process > critical transactions such as bank or credit card. Users of IBM Z want to > extract insights from these transactions using the emerging analytics systems > on IBM Z Linux. These analytics pipelines can be also fast and effective on > IBM Z Linux by using Apache Arrow on memory. > From the technical perspective, since IBM Z Linux uses big-endian data > format, it is not possible to use Apache Arrow in this pipeline. If Apache > Arrow could support big-endian, the use case would be expanded. > When I ran test case of Apache arrow on a big-endian platform (ppc64be), > {{mvn test}} in memory causes a failure due to an assertion. > In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during > an allocation of a {{RootAllocator}} class. > {code} > $ uname -a > Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC > 2016 ppc64 ppc64 ppc64 GNU/Linux > $ arch > ppc64 > $ cd java/memory > $ mvn test > [INFO] Scanning for projects... > [INFO] > > [INFO] > > [INFO] Building Arrow Memory 0.12.0-SNAPSHOT > [INFO] > > [INFO] > ... > [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 > s - in org.apache.arrow.memory.TestAccountant > [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap > [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 > s - in org.apache.arrow.memory.TestLowCostIdentityHashMap > [INFO] Running org.apache.arrow.memory.TestBaseAllocator > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 > s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess > [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess) Time > elapsed: 0.313 s <<< ERROR! > java.lang.ExceptionInInitializerError > at > org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) > Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian > systems. > at > org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) > [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: > 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform
[ https://issues.apache.org/jira/browse/ARROW-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497953#comment-17497953 ] Kazuaki Ishizaki commented on ARROW-3476: - Sure, let me close this issue since this was already solved. > [Java] mvn test in memory fails on a big-endian platform > > > Key: ARROW-3476 > URL: https://issues.apache.org/jira/browse/ARROW-3476 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Kazuaki Ishizaki >Priority: Major > > Apache Arrow is becoming commonplace to exchange data among important > emerging analytics frameworks such as Pandas, Numpy, and Spark. > [IBM Z|https://en.wikipedia.org/wiki/IBM_Z] is one of platforms to process > critical transactions such as bank or credit card. Users of IBM Z want to > extract insights from these transactions using the emerging analytics systems > on IBM Z Linux. These analytics pipelines can be also fast and effective on > IBM Z Linux by using Apache Arrow on memory. > From the technical perspective, since IBM Z Linux uses big-endian data > format, it is not possible to use Apache Arrow in this pipeline. If Apache > Arrow could support big-endian, the use case would be expanded. > When I ran test case of Apache arrow on a big-endian platform (ppc64be), > {{mvn test}} in memory causes a failure due to an assertion. > In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during > an allocation of a {{RootAllocator}} class. > {code} > $ uname -a > Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC > 2016 ppc64 ppc64 ppc64 GNU/Linux > $ arch > ppc64 > $ cd java/memory > $ mvn test > [INFO] Scanning for projects... > [INFO] > > [INFO] > > [INFO] Building Arrow Memory 0.12.0-SNAPSHOT > [INFO] > > [INFO] > ... > [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 > s - in org.apache.arrow.memory.TestAccountant > [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap > [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 > s - in org.apache.arrow.memory.TestLowCostIdentityHashMap > [INFO] Running org.apache.arrow.memory.TestBaseAllocator > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 > s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess > [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess) Time > elapsed: 0.313 s <<< ERROR! > java.lang.ExceptionInInitializerError > at > org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) > Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian > systems. > at > org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) > [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: > 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream
[ https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497949#comment-17497949 ] Kazuaki Ishizaki commented on ARROW-15778: -- [~apitrou] Thank you. In another issue, I am suspecting the endianness in schema. I will look at this. > [Java] Endianness field not emitted in IPC stream > - > > Key: ARROW-15778 > URL: https://issues.apache.org/jira/browse/ARROW-15778 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > It seems the Java IPC writer implementation does not emit the Endianness > information at all (making it Little by default). This complicates > interoperability with the C++ IPC reader, which does read this information > and acts on it to decide whether it needs to byteswap the incoming data. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_starts_monday when rounding to multiple of week
[ https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-15680: --- Summary: [C++] Temporal floor/ceil/round should accept week_starts_monday when rounding to multiple of week (was: [C++] Temporal floor/ceil/round should accept week_start when rounding to multiple of week) > [C++] Temporal floor/ceil/round should accept week_starts_monday when > rounding to multiple of week > --- > > Key: ARROW-15680 > URL: https://issues.apache.org/jira/browse/ARROW-15680 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > See ARROW-14821 and the [related > PR|https://github.com/apache/arrow/pull/12154]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_start when rounding to multiple of week
[ https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15680: --- Labels: kernel pull-request-available (was: kernel) > [C++] Temporal floor/ceil/round should accept week_start when rounding to > multiple of week > --- > > Key: ARROW-15680 > URL: https://issues.apache.org/jira/browse/ARROW-15680 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > See ARROW-14821 and the [related > PR|https://github.com/apache/arrow/pull/12154]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15666) [C++][Python][R] Add format inference option to StrptimeOptions
[ https://issues.apache.org/jira/browse/ARROW-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497921#comment-17497921 ] Matthew Roeschke commented on ARROW-15666: -- Speaking from experience on the pandas side, I agree with [~jorisvandenbossche] and would caution against "inference" logic. While convenient for users, the maintenance burden can be quite significant since inference tends to have an indefinite scope, leading to more custom logic, edge cases, etc > [C++][Python][R] Add format inference option to StrptimeOptions > --- > > Key: ARROW-15666 > URL: https://issues.apache.org/jira/browse/ARROW-15666 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Rok Mihevc >Priority: Major > > We want to have an option to infer timestamp format. > See > [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html] > and lubridate > [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html] > for examples. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15786) [Website] Tidy use of and linking to Arrow logo
Danielle Navarro created ARROW-15786: Summary: [Website] Tidy use of and linking to Arrow logo Key: ARROW-15786 URL: https://issues.apache.org/jira/browse/ARROW-15786 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Danielle Navarro Assignee: Danielle Navarro Now that ARROW-15684 is merged, it would make sense to be consistent in how the Arrow logo is used on the website, and to link to the Visual Identity page where appropriate (e.g., there is a natural place to link to it from the Powered By page). In addition to improving the cross-linking between pages, look for cases where the site can use the updated files rather than the outdated ones (but do not delete the outdated files) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15784) [C++][Python] Parallel parquet file reading disabled with single file reads
[ https://issues.apache.org/jira/browse/ARROW-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15784: --- Labels: pull-request-available (was: ) > [C++][Python] Parallel parquet file reading disabled with single file reads > --- > > Key: ARROW-15784 > URL: https://issues.apache.org/jira/browse/ARROW-15784 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 7.0.0 >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Fix For: 7.0.1 > > Time Spent: 10m > Remaining Estimate: 0h > > There is a flag {{enable_parallel_column_conversion}} which was passed down > from python to C++ when reading parquet datasets which controlled whether we > would read columns in parallel. This was allowed for single files but not > for reading multiple files. This was an old check to help prevent nested > deadlock. > Nested deadlock is no longer an issue and the flag was mostly inert once we > removed the synchronous scanner. > Unfortunately, when we removed the synchronous scanner we forgot to remove > this flag and the result was that a single-file read ended up disabling > parallelism. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
Weston Pace created ARROW-15785: --- Summary: [Benchmarks] Add conbench benchmark for single-file parquet reads Key: ARROW-15785 URL: https://issues.apache.org/jira/browse/ARROW-15785 Project: Apache Arrow Issue Type: Improvement Components: Benchmarking Reporter: Weston Pace Assignee: Weston Pace Release 7.0.0 introduced a regression in parquet single file reads. We should add a macro-level benchmark that does single-file reads to help us detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15784) [C++][Python] Parallel parquet file reading disabled with single file reads
[ https://issues.apache.org/jira/browse/ARROW-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-15784: Issue Type: Bug (was: Improvement) > [C++][Python] Parallel parquet file reading disabled with single file reads > --- > > Key: ARROW-15784 > URL: https://issues.apache.org/jira/browse/ARROW-15784 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 7.0.0 >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Fix For: 7.0.1 > > > There is a flag {{enable_parallel_column_conversion}} which was passed down > from python to C++ when reading parquet datasets which controlled whether we > would read columns in parallel. This was allowed for single files but not > for reading multiple files. This was an old check to help prevent nested > deadlock. > Nested deadlock is no longer an issue and the flag was mostly inert once we > removed the synchronous scanner. > Unfortunately, when we removed the synchronous scanner we forgot to remove > this flag and the result was that a single-file read ended up disabling > parallelism. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15784) [C++][Python] Parallel parquet file reading disabled with single file reads
Weston Pace created ARROW-15784: --- Summary: [C++][Python] Parallel parquet file reading disabled with single file reads Key: ARROW-15784 URL: https://issues.apache.org/jira/browse/ARROW-15784 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 7.0.0 Reporter: Weston Pace Assignee: Weston Pace Fix For: 7.0.1 There is a flag {{enable_parallel_column_conversion}} which was passed down from python to C++ when reading parquet datasets which controlled whether we would read columns in parallel. This was allowed for single files but not for reading multiple files. This was an old check to help prevent nested deadlock. Nested deadlock is no longer an issue and the flag was mostly inert once we removed the synchronous scanner. Unfortunately, when we removed the synchronous scanner we forgot to remove this flag and the result was that a single-file read ended up disabling parallelism. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497856#comment-17497856 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- [~jorisvandenbossche] the new typing generics look interesting. Is it practical to adopt this now. I am referring to the Python versions we support now. Is it wise to use it in the UDF integration and not do what I am suggesting to do in this jira. [~apitrou] Numba jit approach is nice and it looks like an advance feature for UDFs someday. I will keep this in mind. As [~westonpace] suggested, some of our main motivations are to support the user and try to provide user friendly options when we write TPCx-BB queries and similar applications. If the suggestion from [~jorisvandenbossche] to use advance typing is feasible, is it wise to use that instead of doing this change if it succeeds in solving our underlying problem. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497838#comment-17497838 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- I want to clarify a point, if I have not clearly mentioned the reason for the necessity of the typing information earlier in the thread. If I am not mistaken, here the main issue is not what UDF internally is doing for the data. We just need to register it in the function registry without taking the input and output types from the user explicitly. It is just a nice to have a feature which could look great in terms of presentability and usability with new Python upgrades. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15783) [Python] Converting arrow MonthDayNanoInterval to pandas fails DCHECK
Micah Kornfield created ARROW-15783: --- Summary: [Python] Converting arrow MonthDayNanoInterval to pandas fails DCHECK Key: ARROW-15783 URL: https://issues.apache.org/jira/browse/ARROW-15783 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Micah Kornfield Assignee: Micah Kornfield InitPandasStaticData is only called on python/pandas -> Arrow and not the reverse path This causes the DCHECK to make sure the Pandas type is not null to fail if import code is never used. A workaround to users of the library is to call pa.array([1]) which would avoid this issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15782) [C++] Findre2Alt.cmake did not honor RE2_ROOT
[ https://issues.apache.org/jira/browse/ARROW-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15782: --- Labels: pull-request-available (was: ) > [C++] Findre2Alt.cmake did not honor RE2_ROOT > - > > Key: ARROW-15782 > URL: https://issues.apache.org/jira/browse/ARROW-15782 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 7.0.0 >Reporter: Haowei Yu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In Findre2Alt.cmake, cmake module find re2 package config first, which is > likely to find system re2 package first over my customized re2 library. We > should check RE2_ROOT variable first, and if there is no value existed, then > we should search for system re2. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15782) [C++] Findre2Alt.cmake did not honor RE2_ROOT
Haowei Yu created ARROW-15782: - Summary: [C++] Findre2Alt.cmake did not honor RE2_ROOT Key: ARROW-15782 URL: https://issues.apache.org/jira/browse/ARROW-15782 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 7.0.0 Reporter: Haowei Yu In Findre2Alt.cmake, cmake module find re2 package config first, which is likely to find system re2 package first over my customized re2 library. We should check RE2_ROOT variable first, and if there is no value existed, then we should search for system re2. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15697) [R] Add logo and meta tags to pkgdown site
[ https://issues.apache.org/jira/browse/ARROW-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15697. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12439 [https://github.com/apache/arrow/pull/12439] > [R] Add logo and meta tags to pkgdown site > -- > > Key: ARROW-15697 > URL: https://issues.apache.org/jira/browse/ARROW-15697 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Danielle Navarro >Assignee: Danielle Navarro >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The pkgdown site currently doesn't use the Arrow logo and doesn't have nice > social media preview images -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15258) [C++] Easy options to create a source node from a table
[ https://issues.apache.org/jira/browse/ARROW-15258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-15258. - Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12267 [https://github.com/apache/arrow/pull/12267] > [C++] Easy options to create a source node from a table > --- > > Key: ARROW-15258 > URL: https://issues.apache.org/jira/browse/ARROW-15258 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Given a Table there should be a very simple way to create a source node. > Something like: > {code} > std::shared_ptr table = ... > ARROW_RETURN_NOT_OK(arrow::compute::MakeExecNode( > "table", plan, {}, arrow::compute::TableSourceOptions{table.get()})); > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497733#comment-17497733 ] Antoine Pitrou commented on ARROW-15765: > Another dimension to consider is whether a UDF would care if an array were >dictionary encoded or not? We probably want a way to express that too. If you want a UDF to have different implementations based on the parameter types, you can't do that using type annotations. What you could do is use a two-step approach like in Numba's {{generated_jit}}: https://numba.pydata.org/numba-doc/dev/user/generated-jit.html > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_start when rounding to multiple of week
[ https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-15680: -- Assignee: Rok Mihevc > [C++] Temporal floor/ceil/round should accept week_start when rounding to > multiple of week > --- > > Key: ARROW-15680 > URL: https://issues.apache.org/jira/browse/ARROW-15680 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Labels: kernel > > See ARROW-14821 and the [related > PR|https://github.com/apache/arrow/pull/12154]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497720#comment-17497720 ] Weston Pace commented on ARROW-15765: - For a concrete use case consider a user that wants to integrate some kind of Arrow native geojson library. They would have extension types for geojson data types and custom functions that can do things like normalize coordinates to some kind of different reference or format coordinates in a particular way. In this case the UDFs would be taking in extension arrays for custom data types which I think would have its own typings-based considerations. Another possible example that comes from the TPCx-BB benchmark is doing sentiment analysis on strings (is this user comment a positive comment or a negative comment?) If we had an arrow-native natural language processing library we could hook in an extract_sentiment operation which took in strings and returns ? (maybe doubles?). As far as I know the type information itself is only used for validation and casting purposes. Another dimension to consider is whether a UDF would care if an array were dictionary encoded or not? We probably want a way to express that too. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497696#comment-17497696 ] Joris Van den Bossche commented on ARROW-15765: --- In context of a full query plan, I think it is important to know the output types given the input types, to be able to resolve the types in your full query? I am wondering if we could make use of some of the newer typing features, which would allow to do something like {code:python} def simple_function(arrow_array: pa.Array[pa.int32()]) -> pa.Array[pa.int32()]: return call_function("add", [arrow_array, 1]) {code} I think such an object with which you can use [] is called a "generic" in typing terminology (https://docs.python.org/3.11/library/typing.html#generics), and it would allow to more easily get the type of the values in the container. On the other hand it creates a bit a separate typing syntax ({{pa.Array}} is not actually itself a useful class, it's always subclasses you get in practice). > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sifang Li closed ARROW-15780. - Resolution: Not A Problem > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497677#comment-17497677 ] Sifang Li commented on ARROW-15780: --- Thanks - I will close this. > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497675#comment-17497675 ] David Li commented on ARROW-15780: -- The header is not generated until install time > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497674#comment-17497674 ] David Li commented on ARROW-15780: -- Try this {noformat} cmake .. -DARROW_PARQUET=ON -DCMAKE_INSTALL_PREFIX=(path to where you want Arrow to be installed) make -j8 install {noformat} Then point your compiler to the install prefix > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL
[ https://issues.apache.org/jira/browse/ARROW-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15781: --- Labels: pull-request-available (was: ) > [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL > - > > Key: ARROW-15781 > URL: https://issues.apache.org/jira/browse/ARROW-15781 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > See https://github.com/apache/arrow/issues/12501 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL
David Li created ARROW-15781: Summary: [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL Key: ARROW-15781 URL: https://issues.apache.org/jira/browse/ARROW-15781 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: David Li Assignee: David Li See https://github.com/apache/arrow/issues/12501 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497657#comment-17497657 ] Sifang Li commented on ARROW-15780: --- I just ran below: (from the manual config instructions) $ mkdir build-release $ cd build-release $ cmake .. $ make -j8 # if you have 8 CPU cores, otherwise adjust > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497650#comment-17497650 ] David Li commented on ARROW-15780: -- What commands exactly did you run? When I {{ninja install}} I do see {{parquet_version.h}} in the install prefix. > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497647#comment-17497647 ] Sifang Li commented on ARROW-15780: --- It looks like an installation issue - I followed directly to the manual instruction at: [https://github.com/apache/arrow/blob/master/docs/source/developers/cpp/building.rst] The libs are built fine in the out source dir - but the parquet_vrsion.h is missing - see it has a .in file apparently the process did not convert it to .h My cmake is 3.16.3 - is that why? > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.
[ https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-15748. Resolution: Fixed Issue resolved by pull request 12507 [https://github.com/apache/arrow/pull/12507] > [Python] Round temporal options default unit is `day` but documented as > `second`. > - > > Key: ARROW-15748 > URL: https://issues.apache.org/jira/browse/ARROW-15748 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: A. Coady >Assignee: Rok Mihevc >Priority: Minor > Labels: pull-request-available > Fix For: 8.0.0, 7.0.1 > > Time Spent: 20m > Remaining Estimate: 0h > > The [python documentation for round temporal options > |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html] > says the default unit is `second`, but the [actual > behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options] > is a default of `day`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15774) [C++] [CMake] Missing hiveserver2 ErrorCodes_types
[ https://issues.apache.org/jira/browse/ARROW-15774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497626#comment-17497626 ] Antoine Pitrou commented on ARROW-15774: I took some time looking at this. It appears that {{ARROW_HIVESERVER2}} is currently completely broken (it probably has been for a long time). Observations: * we should track the generated {{ErrorCodes.thrift}} in the repository, instead of having to run the Python script generating it on each build * we should put the generated C++ files for thrift definitions inside a new {{src/generated/hiveserver2}} directory * we should update {{build-support/update-thrift.sh}} to recreate said C++ generated files * when I tried to do all the above, it appears that _some_ files are not generated by the Thrift compiler even though they should, and no error of any sort is printed out; trying to ignore those absent files doesn't work, as there are (predictably) missing symbol errors when linking the tests > [C++] [CMake] Missing hiveserver2 ErrorCodes_types > -- > > Key: ARROW-15774 > URL: https://issues.apache.org/jira/browse/ARROW-15774 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: Arch Linux 5.16.10; All dependencies are system packages >Reporter: Pradeep Garigipati >Priority: Major > Attachments: cmake_config_generate.log > > > With cmake preset `ninja-release-maximal`, one of the auto-generated files > seems to be missing and that in turn is resulting in the following error > {code:sh} > [96/576] Building CXX object > src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o > FAILED: > src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o > > /usr/bin/ccache /usr/bin/c++ -DARROW_HAVE_RUNTIME_AVX2 > -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 > -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_HDFS -DARROW_JEMALLOC > -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_MIMALLOC -DARROW_WITH_BROTLI > -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY > -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB > -DARROW_WITH_ZSTD -DURI_STATIC_BUILD > -I/home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/src > -I/home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/src > -I/home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/src/generated -isystem > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/thirdparty/flatbuffers/include > -isystem > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/jemalloc_ep-prefix/src > -isystem > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/mimalloc_ep/src/mimalloc_ep/include/mimalloc-1.7 > -isystem > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/xsimd_ep/src/xsimd_ep-install/include > -isystem > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/zstd_ep-install/include > -isystem > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/thirdparty/hadoop/include > -isystem > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/orc_ep-install/include > -Wno-noexcept-type -fdiagnostics-color=always -O3 -DNDEBUG -Wall > -fno-semantic-interposition -msse4.2 -O3 -DNDEBUG -fPIC -std=c++11 > -Wno-unused-variable -Wno-shadow-field -MD -MT > src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o > -MF > src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o.d > -o > src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o > -c > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/src/arrow/dbi/hiveserver2/ErrorCodes_types.cpp > cc1plus: fatal error: > /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/src/arrow/dbi/hiveserver2/ErrorCodes_types.cpp: > No such file or directory > {code} > I have attached the cmake log of configuration/generation steps. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h
[ https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497625#comment-17497625 ] David Li commented on ARROW-15780: -- How did you install Arrow? > missing header file parquet/parquet_version.h > - > > Key: ARROW-15780 > URL: https://issues.apache.org/jira/browse/ARROW-15780 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: Ubuntu 20.04 >Reporter: Sifang Li >Priority: Blocker > > I am following instructions of writing a table into parquet file: > [https://arrow.apache.org/docs/cpp/parquet.html] > Need to include #include "parquet/arrow/writer.h" > Apparently one header file is missing in the src - cannot find it anywhere: > In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, > ... > ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: > parquet/parquet_version.h: No such file or directory > 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15780) missing header file parquet/parquet_version.h
Sifang Li created ARROW-15780: - Summary: missing header file parquet/parquet_version.h Key: ARROW-15780 URL: https://issues.apache.org/jira/browse/ARROW-15780 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 7.0.0 Environment: Ubuntu 20.04 Reporter: Sifang Li I am following instructions of writing a table into parquet file: [https://arrow.apache.org/docs/cpp/parquet.html] Need to include #include "parquet/arrow/writer.h" Apparently one header file is missing in the src - cannot find it anywhere: In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24, ... ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: parquet/parquet_version.h: No such file or directory 31 | #include "parquet/parquet_version.h" -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15772) [Go][Flight] Server Basic Auth Middleware/Interceptor wrongly base64 decode
[ https://issues.apache.org/jira/browse/ARROW-15772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-15772. --- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12503 [https://github.com/apache/arrow/pull/12503] > [Go][Flight] Server Basic Auth Middleware/Interceptor wrongly base64 decode > --- > > Key: ARROW-15772 > URL: https://issues.apache.org/jira/browse/ARROW-15772 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Affects Versions: 6.0.1, 7.0.0 >Reporter: Risselin Corentin >Priority: Major > Labels: easyfix, pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Currently the implementation of the Auth interceptors uses > `base64.RawStdEncoding.DecodeString` to decode the content of the hanshake. > In Go RawStdEncoding will not uses padding (with '='), trying to authenticate > from pyarrow (with `client.authenticate_basic_token(user, password)`) will > result in an error like: > {quote}{{pyarrow._flight.FlightUnauthenticatedError: gRPC returned > unauthenticated error, with message: invalid basic auth encoding: illegal > base64 data at input byte XX}} > {quote} > StdEncoding would successfully read the content if RawStdEncoding fails. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-15704) [C++] Support static linking with customized jemalloc library
[ https://issues.apache.org/jira/browse/ARROW-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haowei Yu closed ARROW-15704. - Resolution: Won't Fix > [C++] Support static linking with customized jemalloc library > - > > Key: ARROW-15704 > URL: https://issues.apache.org/jira/browse/ARROW-15704 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Haowei Yu >Priority: Major > > [https://github.com/apache/arrow/blob/3a8e409385c8455e6c80b867c5730965a501d113/cpp/cmake_modules/Findjemalloc.cmake#L68] > > It seems that Findjemalloc.cmake think it has found jemalloc only if there > exists shared library. It would be nice if Findjemalloc can statically link > with libjemalloc.a -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14798) [Python] Limit the size of the repr for large Tables
[ https://issues.apache.org/jira/browse/ARROW-14798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-14798. Resolution: Fixed Issue resolved by pull request 12091 [https://github.com/apache/arrow/pull/12091] > [Python] Limit the size of the repr for large Tables > > > Key: ARROW-14798 > URL: https://issues.apache.org/jira/browse/ARROW-14798 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Assignee: Will Jones >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > The new repr is nice that it shows a preview of the data, but this can also > become very long flooding your console output for larger tables. > We already default to 10 preview cols, but each column can still consist of > many chunks. So it might be good to also limit it to 2 chunks? > The ChunkedArray.to_string method already has a {{window}} keyword, but that > seems to control both the number of elements to show per chunk as the number > of chunks (while it would be nice to limit eg to 2 chunks but show up to 10 > elements for each chunk). > cc [~amol-] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497567#comment-17497567 ] Antoine Pitrou commented on ARROW-15765: Of course, another question is: do you need to know the types at all? Without some concrete use cases it's hard to tell. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497566#comment-17497566 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- Should we design this feature or as [~jorisvandenbossche] and [~westonpace] suggested, we can use the inverse option to get the type from the Array type and not exposing this to the user? This issue is at the moment mainly focusing on the UDF usability piece rather than improving a core functionality for Arrow Python API. But it could be useful, but beyond the scope of this usecase it is not very clear to me how useful it is going to be to the user. What do you think? > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497545#comment-17497545 ] Vibhatha Lakmal Abeykoon commented on ARROW-15765: -- [~apitrou] I see your point. There are pitfalls and limitations to this approach. This is mainly a usability piece. I also have a doubt, is it worth investing time on it if the the applications of this becomes niche. But it feels like a nice to have a feature to at least support some widely used UDF function signatures. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497542#comment-17497542 ] Antoine Pitrou commented on ARROW-13168: {{date.h}}'s {{set_install()}} seems to support the text form of the IANA database, which is also what R provides. However, on Python, pytz provides the binary form of the IANA database, which {{date.h}} currently doesn't support on Windows. > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Assignee: Will Jones >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15440) [Go] Implement 'unpack_bool' with Arm64 GoLang Assembly
[ https://issues.apache.org/jira/browse/ARROW-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-15440. --- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12398 [https://github.com/apache/arrow/pull/12398] > [Go] Implement 'unpack_bool' with Arm64 GoLang Assembly > --- > > Key: ARROW-15440 > URL: https://issues.apache.org/jira/browse/ARROW-15440 > Project: Apache Arrow > Issue Type: Task > Components: Go >Reporter: Yuqi Gu >Assignee: Yuqi Gu >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Implement 'unpack_bool' with Arm64 GoLang Assembly. > {code:java} > bytes_to_bools_neon > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15723) [Python] Segfault orcWriter write table
[ https://issues.apache.org/jira/browse/ARROW-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497532#comment-17497532 ] Joris Van den Bossche commented on ARROW-15723: --- Thanks for the report. There are potentially multiple issues here. First, writing null arrays is not actually supported (yet). When using the OrcWriter API directly, we can see this (using the table from the code snippet above): {code} In [3]: writer = orc.ORCWriter("test.orc") In [4]: writer.write(table) ... ArrowNotImplementedError: Unknown or unsupported Arrow type: null ../src/arrow/adapters/orc/util.cc:1062 GetOrcType(*arrow_child_type) ../src/arrow/adapters/orc/adapter.cc:811 GetOrcType(*(table.schema())) {code} But, it seems that for some reason this error is not bubbled up when using {{write_table}} (which uses this ORCWriter in a context manager). Then, it further seems that the segfault comes from trying to write (close) an empty file. This can be reproduced with the following as well: {code} In [1]: from pyarrow import orc In [2]: writer = orc.ORCWriter("test.orc") In [3]: writer.close() Segmentation fault (core dumped) {code} > [Python] Segfault orcWriter write table > > > Key: ARROW-15723 > URL: https://issues.apache.org/jira/browse/ARROW-15723 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 7.0.0 >Reporter: patrice >Priority: Major > > pyarrow segfault when trying to write an orc from a table containing > nullArray. > > from pyarrow import orc > import pyarrow as pa > a = pa.array([1, None, 3, None]) > b = pa.array([None, None, None, None]) > table = pa.table(\{"int64": a, "utf8": b}) > orc.write_table(table, 'test.orc') > zsh: segmentation fault python3 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15723) [Python] Segfault orcWriter write table
[ https://issues.apache.org/jira/browse/ARROW-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15723: -- Fix Version/s: 8.0.0 > [Python] Segfault orcWriter write table > > > Key: ARROW-15723 > URL: https://issues.apache.org/jira/browse/ARROW-15723 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 7.0.0 >Reporter: patrice >Priority: Major > Fix For: 8.0.0 > > > pyarrow segfault when trying to write an orc from a table containing > nullArray. > > from pyarrow import orc > import pyarrow as pa > a = pa.array([1, None, 3, None]) > b = pa.array([None, None, None, None]) > table = pa.table(\{"int64": a, "utf8": b}) > orc.write_table(table, 'test.orc') > zsh: segmentation fault python3 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15715) [Go] ipc.Writer includes unneccessary offsets with sliced arrays
[ https://issues.apache.org/jira/browse/ARROW-15715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-15715. --- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12453 [https://github.com/apache/arrow/pull/12453] > [Go] ipc.Writer includes unneccessary offsets with sliced arrays > > > Key: ARROW-15715 > URL: https://issues.apache.org/jira/browse/ARROW-15715 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Chris Hoff >Assignee: Chris Hoff >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > PR incoming. > > Sliced arrays will be serialized with unnecessary trailing offsets for values > that were sliced off. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects
[ https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497523#comment-17497523 ] Antoine Pitrou commented on ARROW-15765: Note that this approach limits the expressivity of the type annotations. For example, if you write: {code:python} def compute_func(a: pa.ListArray) -> pa.ListArray: ... {code} ... you are not able to tell what the value type of the list type is. Similarly with parametrized types such as timestamps or decimals. > [Python] Extracting Type information from Python Objects > > > Key: ARROW-15765 > URL: https://issues.apache.org/jira/browse/ARROW-15765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > When creating user defined functions or similar exercises where we want to > extract the Arrow data types from the type hints, the existing Python API > have some limitations. > An example case is as follows; > {code:java} > def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array: > return pc.call_function("add", [array1, array2]) > {code} > We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. > At the moment there doesn't exist a straightforward manner to get this done. > So the idea is to expose this feature to Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-15729) [R] Reading large files randomly freezes
[ https://issues.apache.org/jira/browse/ARROW-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-15729. - Resolution: Duplicate > [R] Reading large files randomly freezes > > > Key: ARROW-15729 > URL: https://issues.apache.org/jira/browse/ARROW-15729 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Christian >Priority: Critical > > Hi - > I recently upgraded to Arrow 6.0.1 and am using it in R. > Whenever reading a large file (~10gb) in Windows it randomly freezes > sometimes. I can see the memory being allocated in the first 10-20 seconds, > but then nothing happens and R just doesn't respond (the R process becomes > idle too). > I'm using the option options(arrow.use_threads=FALSE). > I didn't have this issue with the previous version (0.15.1) I was using. And > the file reads fine under Linux. > I would post a reproducible example but it happens randomly. I even thought I > would just read large files in pieces by first getting all the distinct > sections of a specific column (with compute>collect) but that hangs too. > Any ideas would be appreciated. > *Edit* > Not sure if it makes sense to anyone but after a few tries it seems that the > issue only happens in Rstudio. In the R console it loads it fine. All I'm > executing is the below. > options(arrow.use_threads=FALSE) > aa <- arrow::read_arrow('.../file.arrow5') > One thing I want to point out that the underlying Rscript process under > Rstudio seems to definitely use more than one core when executing the above. > *Edit2* > Using arrow::set_cpu_count(1) seems to solve the issue. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15729) [R] Reading large files randomly freezes
[ https://issues.apache.org/jira/browse/ARROW-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15729: -- Fix Version/s: (was: 6.0.1) > [R] Reading large files randomly freezes > > > Key: ARROW-15729 > URL: https://issues.apache.org/jira/browse/ARROW-15729 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Christian >Priority: Critical > > Hi - > I recently upgraded to Arrow 6.0.1 and am using it in R. > Whenever reading a large file (~10gb) in Windows it randomly freezes > sometimes. I can see the memory being allocated in the first 10-20 seconds, > but then nothing happens and R just doesn't respond (the R process becomes > idle too). > I'm using the option options(arrow.use_threads=FALSE). > I didn't have this issue with the previous version (0.15.1) I was using. And > the file reads fine under Linux. > I would post a reproducible example but it happens randomly. I even thought I > would just read large files in pieces by first getting all the distinct > sections of a specific column (with compute>collect) but that hangs too. > Any ideas would be appreciated. > *Edit* > Not sure if it makes sense to anyone but after a few tries it seems that the > issue only happens in Rstudio. In the R console it loads it fine. All I'm > executing is the below. > options(arrow.use_threads=FALSE) > aa <- arrow::read_arrow('.../file.arrow5') > One thing I want to point out that the underlying Rscript process under > Rstudio seems to definitely use more than one core when executing the above. > *Edit2* > Using arrow::set_cpu_count(1) seems to solve the issue. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15730) [R] Memory usage in R blows up
[ https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15730: -- Fix Version/s: (was: 6.0.1) > [R] Memory usage in R blows up > -- > > Key: ARROW-15730 > URL: https://issues.apache.org/jira/browse/ARROW-15730 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 6.0.1 >Reporter: Christian >Assignee: Will Jones >Priority: Major > Attachments: image-2022-02-19-09-05-32-278.png > > > Hi, > I'm trying to load a ~10gb arrow file into R (under Windows) > _(The file is generated in the 6.0.1 arrow version under Linux)._ > For whatever reason the memory usage blows up to ~110-120gb (in a fresh and > empty R instance). > The weird thing is that when deleting the object again and running a gc() the > memory usage goes down to 90gb only. The delta of ~20-30gb is what I would > have expected the dataframe to use up in memory (and that's also approx. what > was used - in total during the load - when running the old arrow version of > 0.15.1. And it is also what R shows me when just printing the object size.) > The commands I'm running are simply: > options(arrow.use_threads=FALSE); > arrow::set_cpu_count(1); # need this - otherwise it freezes under windows > arrow::read_arrow('file.arrow5') > Is arrow reserving some resources in the background and not giving them up > again? Are there some settings I need to change for this? > Is this something that is known and fixed in a newer version? > *Note* that this doesn't happen in Linux. There all the resources are freed > up when calling the gc() function - not sure if it matters but there I also > don't need to set the cpu count to 1. > Any help would be appreciated. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Reopened] (ARROW-15729) [R] Reading large files randomly freezes
[ https://issues.apache.org/jira/browse/ARROW-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reopened ARROW-15729: --- > [R] Reading large files randomly freezes > > > Key: ARROW-15729 > URL: https://issues.apache.org/jira/browse/ARROW-15729 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Christian >Priority: Critical > Fix For: 6.0.1 > > > Hi - > I recently upgraded to Arrow 6.0.1 and am using it in R. > Whenever reading a large file (~10gb) in Windows it randomly freezes > sometimes. I can see the memory being allocated in the first 10-20 seconds, > but then nothing happens and R just doesn't respond (the R process becomes > idle too). > I'm using the option options(arrow.use_threads=FALSE). > I didn't have this issue with the previous version (0.15.1) I was using. And > the file reads fine under Linux. > I would post a reproducible example but it happens randomly. I even thought I > would just read large files in pieces by first getting all the distinct > sections of a specific column (with compute>collect) but that hangs too. > Any ideas would be appreciated. > *Edit* > Not sure if it makes sense to anyone but after a few tries it seems that the > issue only happens in Rstudio. In the R console it loads it fine. All I'm > executing is the below. > options(arrow.use_threads=FALSE) > aa <- arrow::read_arrow('.../file.arrow5') > One thing I want to point out that the underlying Rscript process under > Rstudio seems to definitely use more than one core when executing the above. > *Edit2* > Using arrow::set_cpu_count(1) seems to solve the issue. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15730) [R] Memory usage in R blows up
[ https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15730: -- Affects Version/s: 6.0.1 > [R] Memory usage in R blows up > -- > > Key: ARROW-15730 > URL: https://issues.apache.org/jira/browse/ARROW-15730 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 6.0.1 >Reporter: Christian >Assignee: Will Jones >Priority: Major > Fix For: 6.0.1 > > Attachments: image-2022-02-19-09-05-32-278.png > > > Hi, > I'm trying to load a ~10gb arrow file into R (under Windows) > _(The file is generated in the 6.0.1 arrow version under Linux)._ > For whatever reason the memory usage blows up to ~110-120gb (in a fresh and > empty R instance). > The weird thing is that when deleting the object again and running a gc() the > memory usage goes down to 90gb only. The delta of ~20-30gb is what I would > have expected the dataframe to use up in memory (and that's also approx. what > was used - in total during the load - when running the old arrow version of > 0.15.1. And it is also what R shows me when just printing the object size.) > The commands I'm running are simply: > options(arrow.use_threads=FALSE); > arrow::set_cpu_count(1); # need this - otherwise it freezes under windows > arrow::read_arrow('file.arrow5') > Is arrow reserving some resources in the background and not giving them up > again? Are there some settings I need to change for this? > Is this something that is known and fixed in a newer version? > *Note* that this doesn't happen in Linux. There all the resources are freed > up when calling the gc() function - not sure if it matters but there I also > don't need to set the cpu count to 1. > Any help would be appreciated. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.
[ https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15748: --- Labels: pull-request-available (was: ) > [Python] Round temporal options default unit is `day` but documented as > `second`. > - > > Key: ARROW-15748 > URL: https://issues.apache.org/jira/browse/ARROW-15748 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: A. Coady >Assignee: Rok Mihevc >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.1, 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The [python documentation for round temporal options > |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html] > says the default unit is `second`, but the [actual > behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options] > is a default of `day`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15779) [Python] Create python bindings for Substrait consumer
Weston Pace created ARROW-15779: --- Summary: [Python] Create python bindings for Substrait consumer Key: ARROW-15779 URL: https://issues.apache.org/jira/browse/ARROW-15779 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Weston Pace We will want to figure out how to expose the Substrait consumer to python. This could be a single method that accepts a buffer of bytes and returns an iterator of record batches but we might also want a helper method that returns a table. I'm thinking this would go in the compute namespace. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15736) [C++] Aggregate functions for min and max index.
[ https://issues.apache.org/jira/browse/ARROW-15736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497519#comment-17497519 ] Joris Van den Bossche commented on ARROW-15736: --- For reference, we already have a "argsort" kernel ({{sort_to_indices}}, from ARROW-1566, later renamed in ARROW-6232) > [C++] Aggregate functions for min and max index. > > > Key: ARROW-15736 > URL: https://issues.apache.org/jira/browse/ARROW-15736 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: A. Coady >Priority: Major > Labels: kernel > > Numpy and Pandas both have `argmin` and `argmax`, for the common use case of > finding values in parallel arrays which correspond to min or max values. > Proposals: > * `min_max_index` for arrays > * `hash_min_max_index` for aggregations > * some ability to break ties: > ** `min_max_index` for tables with multiple sort keys, similar to > `sort_indices` > ** `min_max_indices` for arrays to match all equal values -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.
[ https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497513#comment-17497513 ] Rok Mihevc commented on ARROW-15748: Thanks for the analysis and suggestion Joris! I'll do that and add a python test. > [Python] Round temporal options default unit is `day` but documented as > `second`. > - > > Key: ARROW-15748 > URL: https://issues.apache.org/jira/browse/ARROW-15748 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: A. Coady >Assignee: Rok Mihevc >Priority: Minor > Fix For: 7.0.1, 8.0.0 > > > The [python documentation for round temporal options > |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html] > says the default unit is `second`, but the [actual > behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options] > is a default of `day`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15281) [C++] Implement ability to retrieve fragment filename
[ https://issues.apache.org/jira/browse/ARROW-15281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sanjiban Sengupta reassigned ARROW-15281: - Assignee: Sanjiban Sengupta > [C++] Implement ability to retrieve fragment filename > - > > Key: ARROW-15281 > URL: https://issues.apache.org/jira/browse/ARROW-15281 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Assignee: Sanjiban Sengupta >Priority: Major > Labels: dataset, query-engine > > A user has requested the ability to include the filename of the CSV in the > dataset output - see discussion on ARROW-15260 for more context. > Relevant info from that ticket: > > "From a C++ perspective we've got many of the pieces needed already. One > challenge is that the datasets API is written to work with "fragments" and > not "files". For example, a dataset might be an in-memory table in which case > we are working with InMemoryFragment and not FileFragment so there is no > concept of "filename". > That being said, the low level ScanBatchesAsync method actually returns a > generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is > a struct with the record batch as well as the source fragment for that record > batch. > So if you were to execute scan, you could inspect the fragment and, if it is > a FileFragment, you could extract the filename. > Another challenge is that R is moving towards more and more access through an > exec plan and not directly using a scanner. In order for that to work we > would need to augment the scan results with the filename in C++ before > sending into the exec plan. Luckily, we already do this a bit as well. We > currently augment the scan results with fragment index, batch index, and > whether the batch is the last batch in the fragment. > Since ExecBatch can work with constants efficiently I don't think there will > be much performance cost in always including the filename. So the work > remaining is simply to add a new augmented field _{_}fragment_source_name > which is always attached if the underlying fragment is a filename. Then users > can get this field if they want by including "{_}_fragment_source_name" in > the list of columns they query for." -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_start when rounding to multiple of week
[ https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497508#comment-17497508 ] Dragoș Moldovan-Grünfeld commented on ARROW-15680: -- [~apitrou] this is the issue I mentioned on the Labs call > [C++] Temporal floor/ceil/round should accept week_start when rounding to > multiple of week > --- > > Key: ARROW-15680 > URL: https://issues.apache.org/jira/browse/ARROW-15680 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: kernel > > See ARROW-14821 and the [related > PR|https://github.com/apache/arrow/pull/12154]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-5248) [Python] support zoneinfo / dateutil timezones
[ https://issues.apache.org/jira/browse/ARROW-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-5248. -- Resolution: Fixed Issue resolved by pull request 12421 [https://github.com/apache/arrow/pull/12421] > [Python] support zoneinfo / dateutil timezones > -- > > Key: ARROW-5248 > URL: https://issues.apache.org/jira/browse/ARROW-5248 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Alenka Frim >Priority: Minor > Labels: beginner, pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > The {{dateutil}} packages also provides a set of timezone objects > (https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. > In pyarrow, we only support pytz timezones (and the stdlib datetime.timezone > fixed offset): > {code} > In [2]: import dateutil.tz > > > In [3]: import pyarrow as pa > > > In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels')) > > > ... > ~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in > pyarrow.lib.tzinfo_to_string() > ValueError: Unable to convert timezone > `tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string > {code} > But pandas also supports dateutil timezones. As a consequence, when having a > pandas DataFrame that uses a dateutil timezone, you get an error when > converting to an arrow table. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.
[ https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-15748: -- Assignee: Rok Mihevc > [Python] Round temporal options default unit is `day` but documented as > `second`. > - > > Key: ARROW-15748 > URL: https://issues.apache.org/jira/browse/ARROW-15748 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: A. Coady >Assignee: Rok Mihevc >Priority: Minor > Fix For: 7.0.1, 8.0.0 > > > The [python documentation for round temporal options > |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html] > says the default unit is `second`, but the [actual > behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options] > is a default of `day`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream
[ https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497474#comment-17497474 ] David Li commented on ARROW-15778: -- Ah, I misunderstood then, thanks for clarifying. > [Java] Endianness field not emitted in IPC stream > - > > Key: ARROW-15778 > URL: https://issues.apache.org/jira/browse/ARROW-15778 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > It seems the Java IPC writer implementation does not emit the Endianness > information at all (making it Little by default). This complicates > interoperability with the C++ IPC reader, which does read this information > and acts on it to decide whether it needs to byteswap the incoming data. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream
[ https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497472#comment-17497472 ] Antoine Pitrou commented on ARROW-15778: I'm not sure any form of negotiation is needed? The way it works at the IPC level is that the writer emits data in whichever endianness it chooses (also setting the corresponding metadata field to the appropriate value) and the reader decides to byte-swap data is required. So it would work similarly at the Flight level. > [Java] Endianness field not emitted in IPC stream > - > > Key: ARROW-15778 > URL: https://issues.apache.org/jira/browse/ARROW-15778 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > It seems the Java IPC writer implementation does not emit the Endianness > information at all (making it Little by default). This complicates > interoperability with the C++ IPC reader, which does read this information > and acts on it to decide whether it needs to byteswap the incoming data. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.
[ https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497471#comment-17497471 ] Joris Van den Bossche edited comment on ARROW-15748 at 2/24/22, 3:26 PM: - [~coady] thanks for the report! The link you provide for the actual behaviour points to the C++ docs, and while that indeed uses "day", the bindings in Python _do_ use "second": https://github.com/apache/arrow/blob/094c5ba186cddd69d4aa83de5ed2b62d4ed07081/python/pyarrow/_compute.pyx#L892 Now, the confusing part is that this class is not instantiated (I assume) if no options are used at all, and in that case it uses the defaults from C++. You can see this in the following example: {code:python} >>> arr = pa.array([pd.Timestamp("2012-01-01 09:01:02.123456")]) >>> import pyarrow.compute as pc >>> pc.round_temporal(arr)# <--- indeed uses "day" by default [ 2012-01-01 00:00:00.00 ] >>> pc.round_temporal(arr, unit="second")# <--- manually specifying >>> "second" still works [ 2012-01-01 09:01:02.00 ] >>> pc.round_temporal(arr, multiple=5)# <--- but when specifying a >>> different option, it now actually defaults to "second" ... [ 2012-01-01 09:01:00.00 ] {code} Now, long story short, the simple conclusion is of course still that we should align the defaults in C++ and Python was (Author: jorisvandenbossche): The link you provide for the actual behaviour points to the C++ docs, and while that indeed uses "day", the bindings in Python _do_ use "second": https://github.com/apache/arrow/blob/094c5ba186cddd69d4aa83de5ed2b62d4ed07081/python/pyarrow/_compute.pyx#L892 Now, the confusing part is that this class is not instantiated (I assume) if no options are used at all, and in that case it uses the defaults from C++. You can see this in the following example: {code:python} >>> arr = pa.array([pd.Timestamp("2012-01-01 09:01:02.123456")]) >>> import pyarrow.compute as pc >>> pc.round_temporal(arr)# <--- indeed uses "day" by default [ 2012-01-01 00:00:00.00 ] >>> pc.round_temporal(arr, unit="second")# <--- manually specifying >>> "second" still works [ 2012-01-01 09:01:02.00 ] >>> pc.round_temporal(arr, multiple=5)# <--- but when specifying a >>> different option, it now actually defaults to "second" ... [ 2012-01-01 09:01:00.00 ] {code} Now, long story short, the simple conclusion is of course still that we should align the defaults in C++ and Python > [Python] Round temporal options default unit is `day` but documented as > `second`. > - > > Key: ARROW-15748 > URL: https://issues.apache.org/jira/browse/ARROW-15748 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: A. Coady >Priority: Minor > > The [python documentation for round temporal options > |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html] > says the default unit is `second`, but the [actual > behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options] > is a default of `day`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.
[ https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15748: -- Fix Version/s: 7.0.1 8.0.0 > [Python] Round temporal options default unit is `day` but documented as > `second`. > - > > Key: ARROW-15748 > URL: https://issues.apache.org/jira/browse/ARROW-15748 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: A. Coady >Priority: Minor > Fix For: 7.0.1, 8.0.0 > > > The [python documentation for round temporal options > |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html] > says the default unit is `second`, but the [actual > behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options] > is a default of `day`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.
[ https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497471#comment-17497471 ] Joris Van den Bossche commented on ARROW-15748: --- The link you provide for the actual behaviour points to the C++ docs, and while that indeed uses "day", the bindings in Python _do_ use "second": https://github.com/apache/arrow/blob/094c5ba186cddd69d4aa83de5ed2b62d4ed07081/python/pyarrow/_compute.pyx#L892 Now, the confusing part is that this class is not instantiated (I assume) if no options are used at all, and in that case it uses the defaults from C++. You can see this in the following example: {code:python} >>> arr = pa.array([pd.Timestamp("2012-01-01 09:01:02.123456")]) >>> import pyarrow.compute as pc >>> pc.round_temporal(arr)# <--- indeed uses "day" by default [ 2012-01-01 00:00:00.00 ] >>> pc.round_temporal(arr, unit="second")# <--- manually specifying >>> "second" still works [ 2012-01-01 09:01:02.00 ] >>> pc.round_temporal(arr, multiple=5)# <--- but when specifying a >>> different option, it now actually defaults to "second" ... [ 2012-01-01 09:01:00.00 ] {code} Now, long story short, the simple conclusion is of course still that we should align the defaults in C++ and Python > [Python] Round temporal options default unit is `day` but documented as > `second`. > - > > Key: ARROW-15748 > URL: https://issues.apache.org/jira/browse/ARROW-15748 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: A. Coady >Priority: Minor > > The [python documentation for round temporal options > |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html] > says the default unit is `second`, but the [actual > behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options] > is a default of `day`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream
[ https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497463#comment-17497463 ] David Li commented on ARROW-15778: -- Flight doesn't do any endianness detection/negotiation anyways (it expects producer/consumer to set appropriate options) though we should eventually fix that. > [Java] Endianness field not emitted in IPC stream > - > > Key: ARROW-15778 > URL: https://issues.apache.org/jira/browse/ARROW-15778 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > It seems the Java IPC writer implementation does not emit the Endianness > information at all (making it Little by default). This complicates > interoperability with the C++ IPC reader, which does read this information > and acts on it to decide whether it needs to byteswap the incoming data. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497458#comment-17497458 ] Antoine Pitrou edited comment on ARROW-15645 at 2/24/22, 3:12 PM: -- If my diagnosis above is correct, then this is really caused by ARROW-15778. You could work it around by disabling endianness conversion on the Flight client side, but unfortunately that option is not exposed in Python (see ARROW-15777). was (Author: pitrou): If my diagnosis above is correct, then this is really caused by ARROW-15778. You could work it around by disable endianness conversion on the Flight client side, but unfortunately that is not exposed in Python (see ARROW-15777). > [Flight][Java][C++] Data read through Flight is having endianness issue on > s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Java >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15757) [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior
[ https://issues.apache.org/jira/browse/ARROW-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497462#comment-17497462 ] Joris Van den Bossche commented on ARROW-15757: --- Indeed, we should probably ensure users can pass that keyword in write_to_dataset as well. Currently, the {{**kwargs}} are passed to the ParquetFileFormat write options (for parquet specific write options). Thanks for raising the issue! > [Python] Missing bindings for existing_data_behavior makes it impossible to > maintain old behavior > -- > > Key: ARROW-15757 > URL: https://issues.apache.org/jira/browse/ARROW-15757 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 7.0.0 >Reporter: christophe bagot >Priority: Major > > Shouldn't the missing bindings reported earlier in > [https://github.com/apache/arrow/pull/11632] be propagated higher up [here in > the parquet.py > module|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2217]? > Passing **kwargs as is the case for {{write_table}} would do the trick I > think. > I am finding myself stuck while using pandas.to_parquet with > {{use_legacy_dataset=false}} and no way to set the {{existing_data_behavior}} > flag to {{overwrite_or_ignore}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497458#comment-17497458 ] Antoine Pitrou commented on ARROW-15645: If my diagnosis above is correct, then this is really caused by ARROW-15778. You could work it around by disable endianness conversion on the Flight client side, but unfortunately that is not exposed in Python (see ARROW-15777). > [Flight][Java][C++] Data read through Flight is having endianness issue on > s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Java >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream
[ https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497459#comment-17497459 ] Antoine Pitrou commented on ARROW-15778: Also cc [~lidavidm] since this affects Flight. > [Java] Endianness field not emitted in IPC stream > - > > Key: ARROW-15778 > URL: https://issues.apache.org/jira/browse/ARROW-15778 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > It seems the Java IPC writer implementation does not emit the Endianness > information at all (making it Little by default). This complicates > interoperability with the C++ IPC reader, which does read this information > and acts on it to decide whether it needs to byteswap the incoming data. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15776) [Python] Expose IpcReadOptions
[ https://issues.apache.org/jira/browse/ARROW-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497460#comment-17497460 ] Antoine Pitrou commented on ARROW-15776: cc [~alenkaf] [~jorisvandenbossche] > [Python] Expose IpcReadOptions > -- > > Key: ARROW-15776 > URL: https://issues.apache.org/jira/browse/ARROW-15776 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > {{IpcWriteOptions}} is exposed in Python but {{IpcReadOptions}} is not. The > latter is necessary to change endian conversion behaviour. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15757) [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior
[ https://issues.apache.org/jira/browse/ARROW-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15757: -- Summary: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior (was: Missing bindings for existing_data_behavior makes it impossible to maintain old behavior ) > [Python] Missing bindings for existing_data_behavior makes it impossible to > maintain old behavior > -- > > Key: ARROW-15757 > URL: https://issues.apache.org/jira/browse/ARROW-15757 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 7.0.0 >Reporter: christophe bagot >Priority: Major > > Shouldn't the missing bindings reported earlier in > [https://github.com/apache/arrow/pull/11632] be propagated higher up [here in > the parquet.py > module|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2217]? > Passing **kwargs as is the case for {{write_table}} would do the trick I > think. > I am finding myself stuck while using pandas.to_parquet with > {{use_legacy_dataset=false}} and no way to set the {{existing_data_behavior}} > flag to {{overwrite_or_ignore}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream
[ https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497456#comment-17497456 ] Antoine Pitrou commented on ARROW-15778: The offending code seems to be there: https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java#L202-L213 This seems reasonably easy to fix (perhaps a one-line fix, though a test should ideally be added as well). [~emkornfield] [~kiszk] > [Java] Endianness field not emitted in IPC stream > - > > Key: ARROW-15778 > URL: https://issues.apache.org/jira/browse/ARROW-15778 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > It seems the Java IPC writer implementation does not emit the Endianness > information at all (making it Little by default). This complicates > interoperability with the C++ IPC reader, which does read this information > and acts on it to decide whether it needs to byteswap the incoming data. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15778) [Java] Endianness field not emitted in IPC stream
Antoine Pitrou created ARROW-15778: -- Summary: [Java] Endianness field not emitted in IPC stream Key: ARROW-15778 URL: https://issues.apache.org/jira/browse/ARROW-15778 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Antoine Pitrou Fix For: 8.0.0 It seems the Java IPC writer implementation does not emit the Endianness information at all (making it Little by default). This complicates interoperability with the C++ IPC reader, which does read this information and acts on it to decide whether it needs to byteswap the incoming data. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform
[ https://issues.apache.org/jira/browse/ARROW-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497455#comment-17497455 ] Antoine Pitrou commented on ARROW-3476: --- [~kiszk] Does this issue need to remain opened? > [Java] mvn test in memory fails on a big-endian platform > > > Key: ARROW-3476 > URL: https://issues.apache.org/jira/browse/ARROW-3476 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Kazuaki Ishizaki >Priority: Major > > Apache Arrow is becoming commonplace to exchange data among important > emerging analytics frameworks such as Pandas, Numpy, and Spark. > [IBM Z|https://en.wikipedia.org/wiki/IBM_Z] is one of platforms to process > critical transactions such as bank or credit card. Users of IBM Z want to > extract insights from these transactions using the emerging analytics systems > on IBM Z Linux. These analytics pipelines can be also fast and effective on > IBM Z Linux by using Apache Arrow on memory. > From the technical perspective, since IBM Z Linux uses big-endian data > format, it is not possible to use Apache Arrow in this pipeline. If Apache > Arrow could support big-endian, the use case would be expanded. > When I ran test case of Apache arrow on a big-endian platform (ppc64be), > {{mvn test}} in memory causes a failure due to an assertion. > In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during > an allocation of a {{RootAllocator}} class. > {code} > $ uname -a > Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC > 2016 ppc64 ppc64 ppc64 GNU/Linux > $ arch > ppc64 > $ cd java/memory > $ mvn test > [INFO] Scanning for projects... > [INFO] > > [INFO] > > [INFO] Building Arrow Memory 0.12.0-SNAPSHOT > [INFO] > > [INFO] > ... > [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 > s - in org.apache.arrow.memory.TestAccountant > [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap > [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 > s - in org.apache.arrow.memory.TestLowCostIdentityHashMap > [INFO] Running org.apache.arrow.memory.TestBaseAllocator > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 > s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess > [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess) Time > elapsed: 0.313 s <<< ERROR! > java.lang.ExceptionInInitializerError > at > org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) > Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian > systems. > at > org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) > [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: > 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-15563) [C++] Compilation failure on s390x platform
[ https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-15563. -- Resolution: Done [~mr.chandureddy] I suggest you follow ARROW-15645 for updates. Thank you for this report! > [C++] Compilation failure on s390x platform > --- > > Key: ARROW-15563 > URL: https://issues.apache.org/jira/browse/ARROW-15563 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 > Environment: s390x (IBM LinuxONE) >Reporter: Chandra Shekhar Reddy >Priority: Major > > > {code:java} > (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_BUILD_TYPE=debug > -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON > -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON > -DARROW_WITH_BROTLI=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON > -DARROW_BUILD_TESTS=ON .. > -- Building using CMake version: 3.22.2 > -- The C compiler identification is GNU 9.2.1 > -- The CXX compiler identification is GNU 8.5.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: /usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: /usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 3.0.0 (full: '3.0.0') > -- Arrow SO version: 300 (full: 300.0.0) > -- clang-tidy not found > -- clang-format not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found > components: Interpreter > -- Found cpplint executable at > /root/git/repos/arrow/cpp/build-support/cpplint.py > -- System processor: s390x > -- Arrow build warning level: CHECKIN > Using ld linker > Configured for DEBUG build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: DEBUG > -- Using AUTO approach to find dependencies > -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862 > -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90 > -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5 > -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5 > -- ARROW_BOOST_BUILD_VERSION: 1.71.0 > -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 > -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 > -- ARROW_CARES_BUILD_VERSION: 1.16.1 > -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2 > -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2 > -- ARROW_GLOG_BUILD_VERSION: v0.4.0 > -- ARROW_GRPC_BUILD_VERSION: v1.33.2 > -- ARROW_GTEST_BUILD_VERSION: 1.10.0 > -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1 > -- ARROW_LZ4_BUILD_VERSION: v1.9.2 > -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4 > -- ARROW_ORC_BUILD_VERSION: 1.6.2 > -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0 > -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5 > -- ARROW_RE2_BUILD_VERSION: 2019-08-01 > -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8 > -- ARROW_THRIFT_BUILD_VERSION: 0.12.0 > -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183 > -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0 > -- ARROW_ZLIB_BUILD_VERSION: 1.2.11 > -- ARROW_ZSTD_BUILD_VERSION: v1.4.5 > -- Looking for pthread.h > -- Looking for pthread.h - found > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed > -- Check if compiler accepts -pthread > -- Check if compiler accepts -pthread - yes > -- Found Threads: TRUE > -- Checking for module 'thrift' > -- Package 'thrift', required by 'virtual:world', not found > -- Could NOT find Thrift: Found unsuitable version "", but required is at > least "0.11.0" (found THRIFT_LIB-NOTFOUND) > -- Looking for __SIZEOF_INT128__ > -- Looking for __SIZEOF_INT128__ - found > -- Found Boost: /usr/include (found suitable version "1.66.0", minimum > required is "1.58") found components: regex system filesystem > -- Boost include dir: /usr/include > -- Boost libraries: Boost::system;Boost::filesystem > -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR) > -- Building snappy from source > -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec' > -- Package 'libbrotlicommon', required by 'virtual:world', not found > -- Package 'libbrotlienc', required by 'virtual:world', not found > -- Package 'libbrotlidec', required by 'virtual:world', not found > -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY > BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR) > -- Building brotli from source > -- Building without OpenSSL support. Minimum OpenSSL vers
[jira] [Updated] (ARROW-15645) Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15645: --- Component/s: Java (was: Python) > Data read through Flight is having endianness issue on s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Java >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15645: --- Summary: [Flight][Java][C++] Data read through Flight is having endianness issue on s390x (was: Data read through Flight is having endianness issue on s390x) > [Flight][Java][C++] Data read through Flight is having endianness issue on > s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Java >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497452#comment-17497452 ] Antoine Pitrou commented on ARROW-15645: Ok, so my guess is that both server (Java) and client (Python/C++) are on s390x, right? On Arrow C++ 3.0.0, no conversion happens in either Java or C++, and it works since client and server have the same endianness (both big endian). On Arrow C++ 4.0.0+, the Flight client reads the endianness information from the IPC stream. If the machine endianness doesn't match the stream endianness, endianness conversion is attempted by default. Here is the problem: Arrow Java (and the Java Flight server) seems to always set the endianness information to "little" (even on a big endian machine). Arrow C++ interprets that information as meaning a conversion is needed, while the data is already in the right format. > Data read through Flight is having endianness issue on s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Python >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15777) [Python][Flight] Allow passing IpcReadOptions to FlightCallOptions
[ https://issues.apache.org/jira/browse/ARROW-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15777: --- Description: Once {{IpcReadOptions}} is exposed in Python (ARROW-15776), it should also be accepted as an optional parameter to {{FlightCallOptions}}. (was: Once {{IpcReadOptions}} is exposed in Python, it should also be accepted as an optional parameter to {{FlightCallOptions}}.) > [Python][Flight] Allow passing IpcReadOptions to FlightCallOptions > -- > > Key: ARROW-15777 > URL: https://issues.apache.org/jira/browse/ARROW-15777 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > Once {{IpcReadOptions}} is exposed in Python (ARROW-15776), it should also be > accepted as an optional parameter to {{FlightCallOptions}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15777) [Python][Flight] Allow passing IpcReadOptions to FlightCallOptions
Antoine Pitrou created ARROW-15777: -- Summary: [Python][Flight] Allow passing IpcReadOptions to FlightCallOptions Key: ARROW-15777 URL: https://issues.apache.org/jira/browse/ARROW-15777 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Python Reporter: Antoine Pitrou Fix For: 8.0.0 Once {{IpcReadOptions}} is exposed in Python, it should also be accepted as an optional parameter to {{FlightCallOptions}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15776) [Python] Expose IpcReadOptions
[ https://issues.apache.org/jira/browse/ARROW-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15776: --- Fix Version/s: 8.0.0 > [Python] Expose IpcReadOptions > -- > > Key: ARROW-15776 > URL: https://issues.apache.org/jira/browse/ARROW-15776 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > {{IpcWriteOptions}} is exposed in Python but {{IpcReadOptions}} is not. The > latter is necessary to change endian conversion behaviour. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15776) [Python] Expose IpcReadOptions
Antoine Pitrou created ARROW-15776: -- Summary: [Python] Expose IpcReadOptions Key: ARROW-15776 URL: https://issues.apache.org/jira/browse/ARROW-15776 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Antoine Pitrou {{IpcWriteOptions}} is exposed in Python but {{IpcReadOptions}} is not. The latter is necessary to change endian conversion behaviour. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497439#comment-17497439 ] Antoine Pitrou commented on ARROW-15645: Is the client or the server on s390x? > Data read through Flight is having endianness issue on s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Python >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > {code} > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > {code:python} > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15645) Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15645: --- Description: Am facing an endianness issue on s390x(big endian) when converting the data read through flight to pandas data frame. (1) table.validate() fails with error {code} Traceback (most recent call last): File "/tmp/2.py", line 51, in table.validate() File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in binary array {code} (2) table.to_pandas() gives a segmentation fault Here is a sample code that I am using: {code:python} from pyarrow import flight import os import json flight_endpoint = os.environ.get("flight_server_url", "grpc+tls://...local:443") print(flight_endpoint) # class TokenClientAuthHandler(flight.ClientAuthHandler): """An example implementation of authentication via handshake. With the default constructor, the user token is read from the environment: TokenClientAuthHandler(). You can also pass a user token as parameter to the constructor, TokenClientAuthHandler(yourtoken). """ def \_\_init\_\_(self, token: str = None): super().\_\_init\__() if( token != None): strToken = strToken = 'Bearer {}'.format(token) else: strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) self.token = strToken.encode('utf-8') #print(self.token) def authenticate(self, outgoing, incoming): outgoing.write(self.token) self.token = incoming.read() def get_token(self): return self.token readClient = flight.FlightClient(flight_endpoint) readClient.authenticate(TokenClientAuthHandler()) cmd = json.dumps(\{...}) descriptor = flight.FlightDescriptor.for_command(cmd) flightInfo = readClient.get_flight_info(descriptor) reader = readClient.do_get(flightInfo.endpoints[0].ticket) table = reader.read_all() print(table) print(table.num_columns) print(table.num_rows) table.validate() table.to_pandas() {code} was: Am facing an endianness issue on s390x(big endian) when converting the data read through flight to pandas data frame. (1) table.validate() fails with error Traceback (most recent call last): File "/tmp/2.py", line 51, in table.validate() File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in binary array (2) table.to_pandas() gives a segmentation fault Here is a sample code that I am using: from pyarrow import flight import os import json flight_endpoint = os.environ.get("flight_server_url", "grpc+tls://...local:443") print(flight_endpoint) # class TokenClientAuthHandler(flight.ClientAuthHandler): """An example implementation of authentication via handshake. With the default constructor, the user token is read from the environment: TokenClientAuthHandler(). You can also pass a user token as parameter to the constructor, TokenClientAuthHandler(yourtoken). """ def \_\_init\_\_(self, token: str = None): super().\_\_init\__() if( token != None): strToken = strToken = 'Bearer {}'.format(token) else: strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) self.token = strToken.encode('utf-8') #print(self.token) def authenticate(self, outgoing, incoming): outgoing.write(self.token) self.token = incoming.read() def get_token(self): return self.token readClient = flight.FlightClient(flight_endpoint) readClient.authenticate(TokenClientAuthHandler()) cmd = json.dumps(\{...}) descriptor = flight.FlightDescriptor.for_command(cmd) flightInfo = readClient.get_flight_info(descriptor) reader = readClient.do_get(flightInfo.endpoints[0].ticket) table = reader.read_all() print(table) print(table.num_columns) print(table.num_rows) table.validate() table.to_pandas() > Data read through Flight is having endianness issue on s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Python >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > {code} > Traceback (most recent call last): > File "/tmp/2.py", line 51, in >
[jira] [Updated] (ARROW-15767) [Python] Arrow Table with DenseUnion fails to convert to Python Pandas DataFrame
[ https://issues.apache.org/jira/browse/ARROW-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15767: -- Summary: [Python] Arrow Table with DenseUnion fails to convert to Python Pandas DataFrame (was: [Python] Arrow Table with Nullable DenseUnion fails to convert to Python Pandas DataFrame) > [Python] Arrow Table with DenseUnion fails to convert to Python Pandas > DataFrame > > > Key: ARROW-15767 > URL: https://issues.apache.org/jira/browse/ARROW-15767 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 6.0.1 >Reporter: Ben Baumgold >Priority: Major > Attachments: nothing.arrow > > > A feather file containing column of nullable values errors when converting to > a Pandas DataFrame. It can be read into a pyarrow.Table as follows: > {code:python} > In [1]: import pyarrow.feather as feather > In [2]: t = feather.read_table("nothing.arrow") > In [3]: t > Out[3]: > pyarrow.Table > col: dense_union<: null=0, : int32 not null=1> > child 0, : null > child 1, : int32 not null > > col: [ -- is_valid: all not null -- type_ids: [ > 1, > 1, > 1, > 0 > ] -- value_offsets: [ > 0, > 1, > 2, > 0 > ] -- child 0 type: null > 1 nulls -- child 1 type: int32 > [ > 1, > 2, > 3 > ]] > {code} > But when trying to convert the pyarrow.Table into a Pandas DataFrame, I get > the following error: > {code:python} > In [4]: t.to_pandas() > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 t.to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in > pyarrow.lib._PandasConvertible.to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table._to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in > table_to_blockmanager(options, table, categories, ignore_metadata, > types_mapper) > 787 _check_data_column_metadata_consistency(all_columns) > 788 columns = _deserialize_column_index(table, all_columns, > column_indexes) > --> 789 blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > 790 > 791 axes = [columns, index] > ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in > _table_to_blocks(options, block_table, categories, extension_columns) >1126 # Convert an arrow table to Block from the internal pandas API >1127 columns = block_table.column_names > -> 1128 result = pa.lib.table_to_blocks(options, block_table, categories, >1129 list(extension_columns.keys())) >1130 return [_reconstruct_block(item, columns, extension_columns) > ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in > pyarrow.lib.table_to_blocks() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of > type dense_union<: null=0, : int32 not null=1> is known. > {code} > Note the Arrow file is valid and can be read successfully by > [Arrow.jl|https://github.com/apache/arrow-julia]. A related issue is > [arrow-julia#285|https://github.com/apache/arrow-julia/issues/285]. The > [^nothing.arrow] file used in this example is attached for convenience. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15767) [Python] Arrow Table with Nullable DenseUnion fails to convert to Python Pandas DataFrame
[ https://issues.apache.org/jira/browse/ARROW-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497425#comment-17497425 ] Joris Van den Bossche commented on ARROW-15767: --- There is nothing wrong with your file (it is indeed valid, as it can also be read by pyarrow into a pyarrow.Table), but as the error type indicates: this conversion is just not yet implemented. Specifically for the union types, there are not yet much utilities implemented for interacting with this kind of data on the Python (numpy, pandas) <-> Arrow interaction layer. For example, also converting a python structure to a union array is not yet implemented (for this I found ARROW-2774). For the missing conversion to Python, I didn't directly find an issue. For conversion to Python, only a conversion to a plain python list is supported: {code} >>> t["col"].to_pylist() [1, 2, 3, None] {code} In general, we could convert an arrow union type to an object dtype array in numpy/pandas, but that might also not always be very useful. > [Python] Arrow Table with Nullable DenseUnion fails to convert to Python > Pandas DataFrame > - > > Key: ARROW-15767 > URL: https://issues.apache.org/jira/browse/ARROW-15767 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 6.0.1 >Reporter: Ben Baumgold >Priority: Major > Attachments: nothing.arrow > > > A feather file containing column of nullable values errors when converting to > a Pandas DataFrame. It can be read into a pyarrow.Table as follows: > {code:python} > In [1]: import pyarrow.feather as feather > In [2]: t = feather.read_table("nothing.arrow") > In [3]: t > Out[3]: > pyarrow.Table > col: dense_union<: null=0, : int32 not null=1> > child 0, : null > child 1, : int32 not null > > col: [ -- is_valid: all not null -- type_ids: [ > 1, > 1, > 1, > 0 > ] -- value_offsets: [ > 0, > 1, > 2, > 0 > ] -- child 0 type: null > 1 nulls -- child 1 type: int32 > [ > 1, > 2, > 3 > ]] > {code} > But when trying to convert the pyarrow.Table into a Pandas DataFrame, I get > the following error: > {code:python} > In [4]: t.to_pandas() > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 t.to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in > pyarrow.lib._PandasConvertible.to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table._to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in > table_to_blockmanager(options, table, categories, ignore_metadata, > types_mapper) > 787 _check_data_column_metadata_consistency(all_columns) > 788 columns = _deserialize_column_index(table, all_columns, > column_indexes) > --> 789 blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > 790 > 791 axes = [columns, index] > ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in > _table_to_blocks(options, block_table, categories, extension_columns) >1126 # Convert an arrow table to Block from the internal pandas API >1127 columns = block_table.column_names > -> 1128 result = pa.lib.table_to_blocks(options, block_table, categories, >1129 list(extension_columns.keys())) >1130 return [_reconstruct_block(item, columns, extension_columns) > ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in > pyarrow.lib.table_to_blocks() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of > type dense_union<: null=0, : int32 not null=1> is known. > {code} > Note the Arrow file is valid and can be read successfully by > [Arrow.jl|https://github.com/apache/arrow-julia]. A related issue is > [arrow-julia#285|https://github.com/apache/arrow-julia/issues/285]. The > [^nothing.arrow] file used in this example is attached for convenience. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15767) [Python] Arrow Table with Nullable DenseUnion fails to convert to Python Pandas DataFrame
[ https://issues.apache.org/jira/browse/ARROW-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15767: -- Summary: [Python] Arrow Table with Nullable DenseUnion fails to convert to Python Pandas DataFrame (was: Arrow Table with Nullable DenseUnion fails to convert to Python Pandas DataFrame) > [Python] Arrow Table with Nullable DenseUnion fails to convert to Python > Pandas DataFrame > - > > Key: ARROW-15767 > URL: https://issues.apache.org/jira/browse/ARROW-15767 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 6.0.1 >Reporter: Ben Baumgold >Priority: Major > Attachments: nothing.arrow > > > A feather file containing column of nullable values errors when converting to > a Pandas DataFrame. It can be read into a pyarrow.Table as follows: > {code:python} > In [1]: import pyarrow.feather as feather > In [2]: t = feather.read_table("nothing.arrow") > In [3]: t > Out[3]: > pyarrow.Table > col: dense_union<: null=0, : int32 not null=1> > child 0, : null > child 1, : int32 not null > > col: [ -- is_valid: all not null -- type_ids: [ > 1, > 1, > 1, > 0 > ] -- value_offsets: [ > 0, > 1, > 2, > 0 > ] -- child 0 type: null > 1 nulls -- child 1 type: int32 > [ > 1, > 2, > 3 > ]] > {code} > But when trying to convert the pyarrow.Table into a Pandas DataFrame, I get > the following error: > {code:python} > In [4]: t.to_pandas() > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 t.to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in > pyarrow.lib._PandasConvertible.to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table._to_pandas() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in > table_to_blockmanager(options, table, categories, ignore_metadata, > types_mapper) > 787 _check_data_column_metadata_consistency(all_columns) > 788 columns = _deserialize_column_index(table, all_columns, > column_indexes) > --> 789 blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > 790 > 791 axes = [columns, index] > ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in > _table_to_blocks(options, block_table, categories, extension_columns) >1126 # Convert an arrow table to Block from the internal pandas API >1127 columns = block_table.column_names > -> 1128 result = pa.lib.table_to_blocks(options, block_table, categories, >1129 list(extension_columns.keys())) >1130 return [_reconstruct_block(item, columns, extension_columns) > ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in > pyarrow.lib.table_to_blocks() > ~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of > type dense_union<: null=0, : int32 not null=1> is known. > {code} > Note the Arrow file is valid and can be read successfully by > [Arrow.jl|https://github.com/apache/arrow-julia]. A related issue is > [arrow-julia#285|https://github.com/apache/arrow-julia/issues/285]. The > [^nothing.arrow] file used in this example is attached for convenience. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()
[ https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15098: --- Labels: pull-request-available (was: ) > [R] Add binding for lubridate::duration() and/or as.difftime() > -- > > Key: ARROW-15098 > URL: https://issues.apache.org/jira/browse/ARROW-15098 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dewey Dunnington >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > After ARROW-14941 we have support for the duration type; however, there is no > binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr > evaluation that could create these objects. I'm actually not sure if we > should bind {{lubridate::duration}} since it returns a custom S4 class that's > identical in function to base R's difftime. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15291) [C++][Python] Segfault in StructArray.to_numpy and to_pandas if it contains an ExtensionArray
[ https://issues.apache.org/jira/browse/ARROW-15291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15291: --- Labels: pull-request-available (was: ) > [C++][Python] Segfault in StructArray.to_numpy and to_pandas if it contains > an ExtensionArray > - > > Key: ARROW-15291 > URL: https://issues.apache.org/jira/browse/ARROW-15291 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 6.0.1 > Environment: pyarrow 6.0.1, macbook pro >Reporter: quentin lhoest >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hi ! > If you create a StructArray with an ExtensionArray in it, then both to_numpy > and to_pandas segfault in python: > {code:java} > import pyarrow as pa > class CustomType(pa.PyExtensionType): > def __init__(self): > pa.PyExtensionType.__init__(self, pa.binary()) > def __reduce__(self): > return CustomType, () > arr = pa.ExtensionArray.from_storage(CustomType(), pa.array([b"foo"])) > pa.StructArray.from_arrays([arr], ["name"]).to_numpy(zero_copy_only=False) > {code} > Thanks in advance for the help ! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14948) [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and subtraction with timestamp
[ https://issues.apache.org/jira/browse/ARROW-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld reassigned ARROW-14948: Assignee: Dragoș Moldovan-Grünfeld > [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and > subtraction with timestamp > > > Key: ARROW-14948 > URL: https://issues.apache.org/jira/browse/ARROW-14948 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15771) [C++][Compute] Add window join to execution engine
[ https://issues.apache.org/jira/browse/ARROW-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-15771: - Labels: query-engine (was: ) > [C++][Compute] Add window join to execution engine > -- > > Key: ARROW-15771 > URL: https://issues.apache.org/jira/browse/ARROW-15771 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: query-engine > > We would want to support window joins with as-of support. > See https://github.com/substrait-io/substrait/issues/3 for more. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15771) [C++][Compute] Add window join to execution engine
[ https://issues.apache.org/jira/browse/ARROW-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-15771: - Component/s: C++ > [C++][Compute] Add window join to execution engine > -- > > Key: ARROW-15771 > URL: https://issues.apache.org/jira/browse/ARROW-15771 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Priority: Major > > We would want to support window joins with as-of support. > See https://github.com/substrait-io/substrait/issues/3 for more. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14947) [C++] Implement maths with timestamps
[ https://issues.apache.org/jira/browse/ARROW-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497371#comment-17497371 ] Rok Mihevc edited comment on ARROW-14947 at 2/24/22, 12:41 PM: --- [~dragosmg] I didn't look into how rollback works, but I'm guessing it's kind of like _add/sub/mul/div_ followed by a _floor_ or _ceil_? If that is the case you could just use the add/sub/muldiv kernels from ARROW-11090 and add _floor/ceil_ for rollback cases. In case there is something still missing in C++ we should identify it and write Jiras. was (Author: rokm): [~dragosmg] I didn't look into how rollback works, but I'm guessing it's kind of like _floor_ and _ceil_? If that is the case you could just use the add/sub kernels from ARROW-11090 and add _floor/ceil_ for rollback cases. In case there is something still missing in C++ we should identify it and write Jiras. > [C++] Implement maths with timestamps > - > > Key: ARROW-14947 > URL: https://issues.apache.org/jira/browse/ARROW-14947 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Nicola Crane >Priority: Major > > Please could we have maths with timestamps implemented? > In order to implement some of the functionality I'd like in R, I need to be > able to do maths with dates. For example: > * Addition and subtraction: Timestamp + Duration = Timestamp (with and > without rollback so have ability to do e.g. 2021-03-30 minus 1 month and > either get a null back, or 2021-02-28), plus the ability to specify whether > to rollback to the first or last, and whether to preserve or rest the time. > See https://lubridate.tidyverse.org/reference/mplus.html for documentation of > the R functionality. > * Multiplying Durations: Duration * Numeric = Duration -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14947) [C++] Implement maths with timestamps
[ https://issues.apache.org/jira/browse/ARROW-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497371#comment-17497371 ] Rok Mihevc commented on ARROW-14947: [~dragosmg] I didn't look into how rollback works, but I'm guessing it's kind of like _floor_ and _ceil_? If that is the case you could just use the add/sub kernels from ARROW-11090 and add _floor/ceil_ for rollback cases. In case there is something still missing in C++ we should identify it and write Jiras. > [C++] Implement maths with timestamps > - > > Key: ARROW-14947 > URL: https://issues.apache.org/jira/browse/ARROW-14947 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Nicola Crane >Priority: Major > > Please could we have maths with timestamps implemented? > In order to implement some of the functionality I'd like in R, I need to be > able to do maths with dates. For example: > * Addition and subtraction: Timestamp + Duration = Timestamp (with and > without rollback so have ability to do e.g. 2021-03-30 minus 1 month and > either get a null back, or 2021-02-28), plus the ability to specify whether > to rollback to the first or last, and whether to preserve or rest the time. > See https://lubridate.tidyverse.org/reference/mplus.html for documentation of > the R functionality. > * Multiplying Durations: Duration * Numeric = Duration -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x
[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497337#comment-17497337 ] Ravi Gummadi commented on ARROW-15645: -- Flight server side is using java based arrow 6.0.1 version. Client side pyarrow 5.0.0 or 6.0.0 or 7.0.0 all 3 versions are facing the above reported issue. > Data read through Flight is having endianness issue on s390x > > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Python >Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) >Reporter: Ravi Gummadi >Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > Traceback (most recent call last): > File "/tmp/2.py", line 51, in > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > (2) table.to_pandas() gives a segmentation fault > > Here is a sample code that I am using: > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def \_\_init\_\_(self, token: str = None): > super().\_\_init\__() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15563) [C++] Compilation failure on s390x platform
[ https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497299#comment-17497299 ] Chandra Shekhar Reddy commented on ARROW-15563: --- [~apitrou] and [~adeetikaushal] I was able to build the PyArrow 7.0.0 with out any issues. Thank you ! On the other hand issue https://issues.apache.org/jira/browse/ARROW-15645 makes me worried to move to greater level of PyArrow versions on ZLinux. Please advise. Thank you ! > [C++] Compilation failure on s390x platform > --- > > Key: ARROW-15563 > URL: https://issues.apache.org/jira/browse/ARROW-15563 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 > Environment: s390x (IBM LinuxONE) >Reporter: Chandra Shekhar Reddy >Priority: Major > > > {code:java} > (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_BUILD_TYPE=debug > -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON > -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON > -DARROW_WITH_BROTLI=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON > -DARROW_BUILD_TESTS=ON .. > -- Building using CMake version: 3.22.2 > -- The C compiler identification is GNU 9.2.1 > -- The CXX compiler identification is GNU 8.5.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: /usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: /usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 3.0.0 (full: '3.0.0') > -- Arrow SO version: 300 (full: 300.0.0) > -- clang-tidy not found > -- clang-format not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found > components: Interpreter > -- Found cpplint executable at > /root/git/repos/arrow/cpp/build-support/cpplint.py > -- System processor: s390x > -- Arrow build warning level: CHECKIN > Using ld linker > Configured for DEBUG build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: DEBUG > -- Using AUTO approach to find dependencies > -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862 > -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90 > -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5 > -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5 > -- ARROW_BOOST_BUILD_VERSION: 1.71.0 > -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 > -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 > -- ARROW_CARES_BUILD_VERSION: 1.16.1 > -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2 > -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2 > -- ARROW_GLOG_BUILD_VERSION: v0.4.0 > -- ARROW_GRPC_BUILD_VERSION: v1.33.2 > -- ARROW_GTEST_BUILD_VERSION: 1.10.0 > -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1 > -- ARROW_LZ4_BUILD_VERSION: v1.9.2 > -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4 > -- ARROW_ORC_BUILD_VERSION: 1.6.2 > -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0 > -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5 > -- ARROW_RE2_BUILD_VERSION: 2019-08-01 > -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8 > -- ARROW_THRIFT_BUILD_VERSION: 0.12.0 > -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183 > -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0 > -- ARROW_ZLIB_BUILD_VERSION: 1.2.11 > -- ARROW_ZSTD_BUILD_VERSION: v1.4.5 > -- Looking for pthread.h > -- Looking for pthread.h - found > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed > -- Check if compiler accepts -pthread > -- Check if compiler accepts -pthread - yes > -- Found Threads: TRUE > -- Checking for module 'thrift' > -- Package 'thrift', required by 'virtual:world', not found > -- Could NOT find Thrift: Found unsuitable version "", but required is at > least "0.11.0" (found THRIFT_LIB-NOTFOUND) > -- Looking for __SIZEOF_INT128__ > -- Looking for __SIZEOF_INT128__ - found > -- Found Boost: /usr/include (found suitable version "1.66.0", minimum > required is "1.58") found components: regex system filesystem > -- Boost include dir: /usr/include > -- Boost libraries: Boost::system;Boost::filesystem > -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR) > -- Building snappy from source > -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec' > -- Package 'libbrotlicommon', required by 'virtual:world', not found > -- Package 'libbrotlienc', required by 'virtual:world', not found > -- Package 'libbrotlidec', required
[jira] [Commented] (ARROW-14947) [C++] Implement maths with timestamps
[ https://issues.apache.org/jira/browse/ARROW-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497296#comment-17497296 ] Dragoș Moldovan-Grünfeld commented on ARROW-14947: -- [~rokm] I haven't looked much into this. Do you think it's still a C++ component issue? > [C++] Implement maths with timestamps > - > > Key: ARROW-14947 > URL: https://issues.apache.org/jira/browse/ARROW-14947 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Nicola Crane >Priority: Major > > Please could we have maths with timestamps implemented? > In order to implement some of the functionality I'd like in R, I need to be > able to do maths with dates. For example: > * Addition and subtraction: Timestamp + Duration = Timestamp (with and > without rollback so have ability to do e.g. 2021-03-30 minus 1 month and > either get a null back, or 2021-02-28), plus the ability to specify whether > to rollback to the first or last, and whether to preserve or rest the time. > See https://lubridate.tidyverse.org/reference/mplus.html for documentation of > the R functionality. > * Multiplying Durations: Duration * Numeric = Duration -- This message was sent by Atlassian Jira (v8.20.1#820001)