[jira] [Resolved] (ARROW-6933) [Java] Suppor linear dictionary encoder
[ https://issues.apache.org/jira/browse/ARROW-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6933. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5692 [https://github.com/apache/arrow/pull/5692] > [Java] Suppor linear dictionary encoder > --- > > Key: ARROW-6933 > URL: https://issues.apache.org/jira/browse/ARROW-6933 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > For many scenarios, the distribution of dictionary entries is highly skewed. > In other words, a few dictionary entries occurs much more frequently than > others. If we can sort the dictionary by the non-increasing order of entry > frequencies, and compare each value to encode from the beginning of the > dictionary, we get the following benefits: > 1) We need no extra memory space or data structure. > 2) The search is extremely efficient, as we are likely to find a match > in the first few entries of the dictionary. > This is the basic idea behind the linear dictionary encoder. When the > scenario is right (highly skewed dictionary distribution), it outperforms > both search based encoder and hash table based encoders. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6866) [Java] Improve the performance of calculating hash code for struct vector
[ https://issues.apache.org/jira/browse/ARROW-6866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6866. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5633 [https://github.com/apache/arrow/pull/5633] > [Java] Improve the performance of calculating hash code for struct vector > - > > Key: ARROW-6866 > URL: https://issues.apache.org/jira/browse/ARROW-6866 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Improve the performance of hashCode(int) method for StructVector: > 1. We can get the child vectors directly, so there is no need to get the name > from the child vector and then use the name to get the vector. > 2. The child vectors cannot be null, so there is no need to check it. > The performance improvement depends on the complexity of the hash algorithm. > For computational intensive hash algorithms, the improvement can be small; > while for simple hash algorithms, the improvement can be notable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6672) [Java] Extract a common interface for dictionary builders
[ https://issues.apache.org/jira/browse/ARROW-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6672. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5486 [https://github.com/apache/arrow/pull/5486] > [Java] Extract a common interface for dictionary builders > - > > Key: ARROW-6672 > URL: https://issues.apache.org/jira/browse/ARROW-6672 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > We need a common interface for dictionary builders to support more > sophisticated scenarios, like collecting dictionary statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector
[ https://issues.apache.org/jira/browse/ARROW-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6394. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5235 [https://github.com/apache/arrow/pull/5235] > [Java] Support conversions between delta vector and partial sum vector > -- > > Key: ARROW-6394 > URL: https://issues.apache.org/jira/browse/ARROW-6394 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > What is a delta vector/partial sum vector? > Given an integer vector a with length n, its partial sum vector is another > integer vector b with length n + 1, with values defined as: > b(0) = initial sum > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n > Given an integer vector with length n + 1, its delta vector is another > integer vector b with length n, with values defined as: > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1 > In this issue, we provide utilities to convert between vector and partial sum > vector. It is interesting to note that the two operations corresponding to > the discrete integration and differentian. > These conversions have wide applications. For example, > 1. The run-length vector proposed by Micah is based on the partial sum > vector, while the deduplication functionality is based on delta vector. This > issue provides conversions between them. > 2. The current VarCharVector/VarBinaryVector implementations are based on > partial sum vector. We can transform them to delta vectors before IPC, to > reduce network traffic. > 3. Converting to delta can be considered as a way for data compression. To > further reduce the data volume, the operation can be applied more than once, > to further reduce data volume. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958490#comment-16958490 ] Neal Richardson commented on ARROW-6977: I made ARROW-6983 for the threading issue. ARROW-6979 is for the jemalloc for macOS R packages; do I remember correctly that jemalloc isn't available for Windows? As for this issue, assuming it's innocuous, it's just annoying that every time I load the package I get this message. I don't know how to suppress it though. It would be nice if we could set background_thread: true in our code based on the same condition that jemalloc is checking. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6983) [C++] Threaded task group crashes sometimes
Neal Richardson created ARROW-6983: -- Summary: [C++] Threaded task group crashes sometimes Key: ARROW-6983 URL: https://issues.apache.org/jira/browse/ARROW-6983 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson Assignee: Antoine Pitrou Fix For: 0.15.1 You can give this a more descriptive title :) See discussion on ARROW-6977. https://gist.github.com/pitrou/87f3091c226db3306c45b2c32dd9aea8 seems to fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958435#comment-16958435 ] Wes McKinney commented on ARROW-6977: - The thread pool thing seems to be a new issue, can we open a JIRA for that? What do you want to do about the background thread issue? We can set decay_ms=0 by default with no background thread, but I'm not sure what the performance impact will be. Creating packages without jemalloc is not a good idea because of the performance implications. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6966) [Go] 32bit memset is null
[ https://issues.apache.org/jira/browse/ARROW-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6966: Fix Version/s: 1.0.0 > [Go] 32bit memset is null > - > > Key: ARROW-6966 > URL: https://issues.apache.org/jira/browse/ARROW-6966 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jonathan A Sternberg >Assignee: Jonathan A Sternberg >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > If you use a function that calls `memset.Set`, the implementation on a 32 bit > machine seems to be unset. This happened in our 32 bit build here: > [https://circleci.com/gh/influxdata/influxdb/66112#tests/containers/2] > {code:java} > goroutine 66 [running]:goroutine 66 > [running]:testing.tRunner.func1(0x9e1f2c0) > /usr/local/go/src/testing/testing.go:830 +0x30epanic(0x899cb40, 0x9403c40) > /usr/local/go/src/runtime/panic.go:522 > +0x16egithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory.Set(...) > > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory/memory.go:25github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).init(0x9e44990, > 0x20) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:101 > > +0xc7github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).init(0x9e44990, > 0x20) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:102 > > +0x2fgithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Resize(0x9e44990, > 0x2) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:125 > > +0x42github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).reserve(0x9e44990, > 0x1, 0x9c52464) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:138 > > +0x72github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Reserve(0x9e44990, > 0x1) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:113 > > +0x51github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow.NewInt(0x9e4a770, > 0x1, 0x1, 0x0, 0x89f0360) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow/int.go:10 > > +0x6cgithub.com/influxdata/influxdb/storage/reads.(*floatTable).advance(0x9e42070, > 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:91 > +0x7egithub.com/influxdata/influxdb/storage/reads.newFloatTable(0x9e17740, > 0xe521a160, 0x9e1b8c0, 0x0, 0x0, 0x1e, 0x0, 0x8c13be0, 0x9e448a0, 0x9e448d0, > ...) > /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:47 > +0x1c2github.com/influxdata/influxdb/storage/reads.(*filterIterator).handleRead(0x9e22840, > 0x9e0d1a0, 0x8c0ce00, 0x9e48780, 0x0, 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:177 > +0x755github.com/influxdata/influxdb/storage/reads.(*filterIterator).Do(0x9e22840, > 0x9e0d170, 0x9c40070, 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:140 > +0x138github.com/influxdata/influxdb/storage/reads_test.TestDuplicateKeys_ReadFilter(0x9e1f2c0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader_test.go:89 > +0x1dftesting.tRunner(0x9e1f2c0, 0x8ad44e4) > /usr/local/go/src/testing/testing.go:865 +0x97created by testing.(*T).Run > /usr/local/go/src/testing/testing.go:916 +0x2b2 > {code} > I added a print statement at where memset happened to print the function that > was being used and got this: > {code} > [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 0 > {code} > If I set {{memset}} with a default, the code that calls into this works fine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6966) [Go] 32bit memset is null
[ https://issues.apache.org/jira/browse/ARROW-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6966: --- Assignee: Jonathan A Sternberg > [Go] 32bit memset is null > - > > Key: ARROW-6966 > URL: https://issues.apache.org/jira/browse/ARROW-6966 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jonathan A Sternberg >Assignee: Jonathan A Sternberg >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > If you use a function that calls `memset.Set`, the implementation on a 32 bit > machine seems to be unset. This happened in our 32 bit build here: > [https://circleci.com/gh/influxdata/influxdb/66112#tests/containers/2] > {code:java} > goroutine 66 [running]:goroutine 66 > [running]:testing.tRunner.func1(0x9e1f2c0) > /usr/local/go/src/testing/testing.go:830 +0x30epanic(0x899cb40, 0x9403c40) > /usr/local/go/src/runtime/panic.go:522 > +0x16egithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory.Set(...) > > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory/memory.go:25github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).init(0x9e44990, > 0x20) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:101 > > +0xc7github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).init(0x9e44990, > 0x20) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:102 > > +0x2fgithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Resize(0x9e44990, > 0x2) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:125 > > +0x42github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).reserve(0x9e44990, > 0x1, 0x9c52464) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:138 > > +0x72github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Reserve(0x9e44990, > 0x1) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:113 > > +0x51github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow.NewInt(0x9e4a770, > 0x1, 0x1, 0x0, 0x89f0360) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow/int.go:10 > > +0x6cgithub.com/influxdata/influxdb/storage/reads.(*floatTable).advance(0x9e42070, > 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:91 > +0x7egithub.com/influxdata/influxdb/storage/reads.newFloatTable(0x9e17740, > 0xe521a160, 0x9e1b8c0, 0x0, 0x0, 0x1e, 0x0, 0x8c13be0, 0x9e448a0, 0x9e448d0, > ...) > /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:47 > +0x1c2github.com/influxdata/influxdb/storage/reads.(*filterIterator).handleRead(0x9e22840, > 0x9e0d1a0, 0x8c0ce00, 0x9e48780, 0x0, 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:177 > +0x755github.com/influxdata/influxdb/storage/reads.(*filterIterator).Do(0x9e22840, > 0x9e0d170, 0x9c40070, 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:140 > +0x138github.com/influxdata/influxdb/storage/reads_test.TestDuplicateKeys_ReadFilter(0x9e1f2c0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader_test.go:89 > +0x1dftesting.tRunner(0x9e1f2c0, 0x8ad44e4) > /usr/local/go/src/testing/testing.go:865 +0x97created by testing.(*T).Run > /usr/local/go/src/testing/testing.go:916 +0x2b2 > {code} > I added a print statement at where memset happened to print the function that > was being used and got this: > {code} > [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 0 > {code} > If I set {{memset}} with a default, the code that calls into this works fine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958403#comment-16958403 ] Neal Richardson commented on ARROW-6977: I applied the patch and have run the R test suite 10 times in a row, all good. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958328#comment-16958328 ] Neal Richardson commented on ARROW-6977: Just a reminder that this is on master but I've rebuilt C++ with jemalloc off, so I'm no longer seeing that warning message. So this may be unrelated to the initial report. Feel free to move this to a different issue if you see fit. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958329#comment-16958329 ] Antoine Pitrou commented on ARROW-6977: --- Can you give this patch a try? Run with it a number of times. https://gist.github.com/pitrou/87f3091c226db3306c45b2c32dd9aea8 > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958327#comment-16958327 ] Neal Richardson commented on ARROW-6977: Re-running a bunch now, I've crashed here (to the best of my estimation): * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-csv.R#L43 * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-csv.R#L44 * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-csv.R#L45 (3x) * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-csv.R#L58-L59 * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-Table.R#L133-L135 * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-Table.R#L140 * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-Table.R#L189-L192 * https://github.com/apache/arrow/blob/master/r/tests/testthat/test-Table.R#L196-L205 The common thread (excuse the pun) I *think* is that they take a Table and bring it into R as a data.frame. This function: https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L811-L828 About 2 out of 3 times the test suite completes cleanly. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321 ] Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:59 PM: - I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type: {code:java} Box) -> Result{code} so if a users defines a function such as {code:java} fn length(s: String) -> usize{code} we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. was (Author: kylemccarthy): I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type: ```Box) -> Result``` so if a users defines a function such as ```fn length(s: String) -> usize``` we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321 ] Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:59 PM: - I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type: {code:java} Box) -> Result{code} so if a users defines a function such as {code:java} fn length(s: String) -> usize{code} we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. was (Author: kylemccarthy): I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type: {code:java} Box) -> Result{code} so if a users defines a function such as {code:java} fn length(s: String) -> usize{code} we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321 ] Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:58 PM: - I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type: ```Box) -> Result``` so if a users defines a function such as ```fn length(s: String) -> usize``` we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. was (Author: kylemccarthy): I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type ```Box) -> Result```, so if a users defines a function such as ```fn length(s: String) -> usize``` we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321 ] Kyle McCarthy edited comment on ARROW-6947 at 10/23/19 10:58 PM: - I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type ```Box) -> Result```, so if a users defines a function such as ```fn length(s: String) -> usize``` we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. was (Author: kylemccarthy): I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type `Box) -> Result`, so if a users defines a function such as `fn length(s: String) -> usize` we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958321#comment-16958321 ] Kyle McCarthy commented on ARROW-6947: -- I am curious to see if you have any ideas about how this would work. I have been working on a PoC, but will probably need to make some design decisions and would like to see if they align with yours. At a high level, I see this working by composing a UDF with some general ScalarFunction type. Right now I have the ScalarFunction with type `Box) -> Result`, so if a users defines a function such as `fn length(s: String) -> usize` we would wrap that and return our ScalarFunction. I think that the composed functions need to be associated with some "static" metadata, similar to the FunctionMeta in the logical plan. I think we would want to know the DataType of the arguments that the function expects and if they are optional, as well as the return type and if it is fallible/infallible. If the UDF accepts and returns primitive rust types, generating that meta data should be pretty straight forward. However, if the UDF takes/returns ScalarValues the user would have to specifically provide the metadata. We would be able to generate most of the data for the logical plan's FunctionMeta but would still need the function name and the field names for the args. As of right now, I haven't done anything related to Aggregate UDFs or actually registering them with the ExecutionContext. > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958318#comment-16958318 ] Neal Richardson commented on ARROW-6977: On this particular crash (after 14 assertions passed), it looks like https://github.com/apache/arrow/blob/master/r/tests/testthat/test-csv.R#L119 was the line that crashed. But like I said, it sometimes doesn't fail there, sometimes fails earlier, and sometimes fails in a different test file. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958316#comment-16958316 ] Neal Richardson commented on ARROW-6977: It doesn't always fail in the same place. Sometimes I see three test assertions pass and then it fails, which means that it errors before the next assertion. Which would put it somewhere around https://github.com/apache/arrow/blob/master/r/tests/testthat/test-csv.R#L43-L45. But other times it doesn't fail on that block and fails later. I'll try rewriting the tests to disambiguate and see if there's a pattern of where exactly it fails. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958313#comment-16958313 ] Antoine Pitrou commented on ARROW-6977: --- Yeah, but what does the test it fails in precisely do? > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6925) [C++] Arrow fails to buld on MacOS 10.13.6 using brew gcc 7 and 8
[ https://issues.apache.org/jira/browse/ARROW-6925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958312#comment-16958312 ] Wes McKinney commented on ARROW-6925: - [~fsaintjacques] John needed to be added to the "Contributor" role on JIRA -- done now. You are already an admin in JIRA for Apache Arrow so you should get familiar with how to do this from the JIRA administration page starting from the top right > [C++] Arrow fails to buld on MacOS 10.13.6 using brew gcc 7 and 8 > - > > Key: ARROW-6925 > URL: https://issues.apache.org/jira/browse/ARROW-6925 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: MacOS 10.13.6 using both brew gcc 7 and 8. >Reporter: John Norris >Assignee: John Norris >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Both SetupCxxFlags.cmake and ThirdpartyToolchain.cmake add -stdlib=libc++ to > the compiler flags when APPLE is true, but if you're using GCC from brew (or > presumably from anywhere other that Apple), this flag is not recognized and > the build fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6925) [C++] Arrow fails to buld on MacOS 10.13.6 using brew gcc 7 and 8
[ https://issues.apache.org/jira/browse/ARROW-6925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6925: --- Assignee: John Norris > [C++] Arrow fails to buld on MacOS 10.13.6 using brew gcc 7 and 8 > - > > Key: ARROW-6925 > URL: https://issues.apache.org/jira/browse/ARROW-6925 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: MacOS 10.13.6 using both brew gcc 7 and 8. >Reporter: John Norris >Assignee: John Norris >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Both SetupCxxFlags.cmake and ThirdpartyToolchain.cmake add -stdlib=libc++ to > the compiler flags when APPLE is true, but if you're using GCC from brew (or > presumably from anywhere other that Apple), this flag is not recognized and > the build fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958309#comment-16958309 ] Neal Richardson commented on ARROW-6977: {code} (lldb) bt all libR.dylib was compiled with optimization - stepping may behave oddly; variables may not be available. thread #1, queue = 'com.apple.main-thread' frame #0: 0x000100170660 libR.dylib`R_HashGet(hashcode=117, symbol=0x0001010689f8, table=) at envir.c:0 [opt] frame #1: 0x000100171246 libR.dylib`Rf_findFun3(symbol=0x0001010689f8, rho=0x00010583a680, call=) at envir.c:1521:11 [opt] frame #2: 0x0001001875d3 libR.dylib`bcEval(body=, rho=0x00011284c468, useCache=) at eval.c:6560:15 [opt] frame #3: 0x000100182aed libR.dylib`Rf_eval(e=, rho=) at eval.c:620:8 [opt] frame #4: 0x0001001a13a9 libR.dylib`R_execClosure(call=0x000105f48098, newrho=, sysparent=, rho=0x000112852f20, arglist=, op=) at eval.c:0:19 [opt] frame #5: 0x0001001a02aa libR.dylib`Rf_applyClosure(call=0x000105f48098, op=0x00010995f8b0, arglist=0x00011284c190, rho=0x000112852f20, suppliedvars=0x0001018058e0) at eval.c:1706:16 [opt] frame #6: 0x000100189f11 libR.dylib`bcEval(body=, rho=0x000112852f20, useCache=) at eval.c:6733:12 [opt] frame #7: 0x000100182aed libR.dylib`Rf_eval(e=, rho=) at eval.c:620:8 [opt] frame #8: 0x0001001a13a9 libR.dylib`R_execClosure(call=0x0001193dcc80, newrho=, sysparent=, rho=0x0001193eb620, arglist=, op=) at eval.c:0:19 [opt] frame #9: 0x0001001a02aa libR.dylib`Rf_applyClosure(call=0x0001193dcc80, op=0x000105f34f28, arglist=0x000112852d28, rho=0x0001193eb620, suppliedvars=0x0001018058e0) at eval.c:1706:16 [opt] frame #10: 0x00010018301d libR.dylib`Rf_eval(e=0x0001193dcc80, rho=0x0001193eb620) at eval.c:743:12 [opt] frame #11: 0x0001001a3a20 libR.dylib`do_begin(call=0x0001193da158, op=0x00010180f000, args=0x0001193dcba0, rho=) at eval.c:2382:10 [opt] frame #12: 0x000100182ce0 libR.dylib`Rf_eval(e=, rho=0x0001193eb620) at eval.c:695:12 [opt] frame #13: 0x00010019fa63 libR.dylib`forcePromise(e=0x0001129e8938) at eval.c:516:8 [opt] frame #14: 0x000100182dd0 libR.dylib`Rf_eval(e=, rho=) at eval.c:643:9 [opt] frame #15: 0x0001001a3a20 libR.dylib`do_begin(call=0x000109947450, op=0x00010180f000, args=0x0001099476b8, rho=) at eval.c:2382:10 [opt] frame #16: 0x000100182ce0 libR.dylib`Rf_eval(e=, rho=0x0001129dae78) at eval.c:695:12 [opt] frame #17: 0x0001001a4d66 libR.dylib`do_eval(call=, op=0x0001018260b0, args=, rho=) at eval.c:3186:13 [opt] frame #18: 0x00010018a326 libR.dylib`bcEval(body=, rho=0x0001129d60e8, useCache=) at eval.c:6765:14 [opt] frame #19: 0x000100182aed libR.dylib`Rf_eval(e=, rho=) at eval.c:620:8 [opt] frame #20: 0x0001001a13a9 libR.dylib`R_execClosure(call=0x00010990aa40, newrho=, sysparent=, rho=0x0001129e6f58, arglist=, op=) at eval.c:0:19 [opt] frame #21: 0x0001001a02aa libR.dylib`Rf_applyClosure(call=0x00010990aa40, op=0x0001028046d8, arglist=0x0001129d9ea8, rho=0x0001129e6f58, suppliedvars=0x0001018058e0) at eval.c:1706:16 [opt] frame #22: 0x000100189f11 libR.dylib`bcEval(body=, rho=0x0001129e6f58, useCache=) at eval.c:6733:12 [opt] frame #23: 0x000100182aed libR.dylib`Rf_eval(e=, rho=) at eval.c:620:8 [opt] frame #24: 0x00010019fa63 libR.dylib`forcePromise(e=0x0001129d9310) at eval.c:516:8 [opt] frame #25: 0x0001001aa2ec libR.dylib`getvar [inlined] FORCE_PROMISE(value=, symbol=, rho=, keepmiss=) at eval.c:4897:15 [opt] frame #26: 0x0001001aa2e4 libR.dylib`getvar(symbol=0x000101851fc8, rho=0x0001129d97e0, dd=, keepmiss=, vcache=, sidx=, stack_base=0x000100b04ff0) at eval.c:4970 [opt] frame #27: 0x000100187094 libR.dylib`bcEval(body=, rho=0x0001129d97e0, useCache=) at eval.c:6517:20 [opt] frame #28: 0x000100182aed libR.dylib`Rf_eval(e=, rho=) at eval.c:620:8 [opt] frame #29: 0x0001001a13a9 libR.dylib`R_execClosure(call=0x00010990ab20, newrho=, sysparent=, rho=0x0001129e6f58, arglist=, op=) at eval.c:0:19 [opt] frame #30: 0x0001001a02aa libR.dylib`Rf_applyClosure(call=0x00010990ab20, op=0x0001017539b8, arglist=0x0001129d9348, rho=0x0001129e6f58, suppliedvars=0x0001018058e0) at eval.c:1706:16 [opt] frame #31: 0x000100189f11 libR.dylib`bcEval(body=, rho=0x0001129e6f58, useCache=) at eval.c:6733:12 [opt] frame #32: 0x000100182aed libR.dylib`Rf_eval(e=, rho=) at eval.c:620:8 [opt] frame #33: 0x00010019fa63 libR.dylib`forcePromise(e=0x0001129daee8) at eval.c:516:8 [opt] frame #34: 0x0001001aa2ec
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958305#comment-16958305 ] Neal Richardson commented on ARROW-6977: It most often fails in the CSV reader, which itself has multithreading (recently revised?) and when the data is pulled from Arrow into and R data.frame, it also uses threads. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958304#comment-16958304 ] Antoine Pitrou commented on ARROW-6977: --- Also, it would be nice if you could give a bit of context? (why is the test doing? can you run them in verbose mode to see where it's crashing?) > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6968) [Python] 0.14.1 to 0.15.0 upgrade produces AttributeError
[ https://issues.apache.org/jira/browse/ARROW-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-6968. --- Resolution: Won't Fix > [Python] 0.14.1 to 0.15.0 upgrade produces AttributeError > - > > Key: ARROW-6968 > URL: https://issues.apache.org/jira/browse/ARROW-6968 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: Python 3.7.4 on macOS Mojave 10.14.6 > Python 3.6.7 on Ubuntu 16.04.6 LTS >Reporter: Michael Wheeler >Priority: Major > Attachments: attribute_error_pyarrow_0_15_0.py > > > The code in question: > {code:java} > """ > Reproduce AttributeError with PyArrow == 0.15.0 > """ > import io > import logging > import pandas > import pyarrow > import sys > import textwrap > logging.basicConfig(level=logging.DEBUG) > logging.debug(f'Python > v{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}') > logging.debug(f'PyArrow v{pyarrow.__version__}' + '\n') > CSV_TEXT = textwrap.dedent("""\ > id,gender,some_date,age > 001,M,01/01/2019,75 > 002,F,02/02/2018,32 > 003,M,03/03/2017,27 > 004,F,04/04/2016,19 > 005,M,05/05/2015,55 > 006,F,06/06/2014,42 > """) > # Initialize pyarrow table via pandas > mock_file = io.StringIO(CSV_TEXT) > df = pandas.read_csv(mock_file).sort_values(['age', 'gender']) > table = pyarrow.Table.from_pandas(df=df) > # This comprehension generates a map between the name of the column and its > index > map_col_names_to_incides = {item.name: table.columns.index(item) for item in > table.columns} > logging.debug('The column indices are:') > for name, index in map_col_names_to_incides.items(): > logging.debug(f'Col {name} -> #{index}') > {code} > > Expected result (generated with 0.14.0): > {code:java} > DEBUG:root:Python v3.7.4 > DEBUG:root:PyArrow v0.14.1 > DEBUG:root:The column indices are: > DEBUG:root:Col id -> #0 > DEBUG:root:Col gender -> #1 > DEBUG:root:Col some_date -> #2 > DEBUG:root:Col age -> #3 > DEBUG:root:Col __index_level_0__ -> #4 > {code} > Actual result (generated with 0.15.0): > {code:java} > DEBUG:root:Python v3.7.4 > DEBUG:root:PyArrow v0.15.0 > Traceback (most recent call last): > File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line > 1758, in > main() > File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line > 1752, in main > globals = debugger.run(setup['file'], None, None, is_module) > File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line > 1147, in run > pydev_imports.execfile(file, globals, locals) # execute the script > File > "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", > line 18, in execfile > exec(compile(contents+"\n", file, 'exec'), glob, loc) > File > "/Users/mwheeler/Library/Preferences/PyCharm2019.1/scratches/scratch.py", > line 31, in > map_col_names_to_incides = {item.name: table.columns.index(item) for item > in table.columns} > File > "/Users/mwheeler/Library/Preferences/PyCharm2019.1/scratches/scratch.py", > line 31, in > map_col_names_to_incides = {item.name: table.columns.index(item) for item > in table.columns} > AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'name' > {code} > > This error occurs in both of the environments specified above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958302#comment-16958302 ] Antoine Pitrou commented on ARROW-6977: --- Could you post the backtrace for all threads? Something like "thread apply all bt" should do. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958299#comment-16958299 ] Neal Richardson commented on ARROW-6977: With lldb: {code} Assertion failed: (ec == 0), function unlock, file /BuildRoot/Library/Caches/com.apple.xbs/Sources/libcxx/libcxx-400.9.4/src/mutex.cpp, line 48. Process 36128 stopped * thread #10, stop reason = signal SIGABRT frame #0: 0x7fff729182c6 libsystem_kernel.dylib`__pthread_kill + 10 libsystem_kernel.dylib`__pthread_kill: -> 0x7fff729182c6 <+10>: jae0x7fff729182d0; <+20> 0x7fff729182c8 <+12>: movq %rax, %rdi 0x7fff729182cb <+15>: jmp0x7fff72912457; cerror_nocancel 0x7fff729182d0 <+20>: retq Target 0: (R) stopped. (lldb) bt * thread #10, stop reason = signal SIGABRT * frame #0: 0x7fff729182c6 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff729cdbf1 libsystem_pthread.dylib`pthread_kill + 284 frame #2: 0x7fff728826a6 libsystem_c.dylib`abort + 127 frame #3: 0x7fff7284b20d libsystem_c.dylib`__assert_rtn + 324 frame #4: 0x7fff6f7e79a4 libc++.1.dylib`std::__1::mutex::unlock() + 46 frame #5: 0x00010a8d991a libarrow.100.dylib`std::__1::unique_lock::~unique_lock(this=0x7935db70) at __mutex_base:153:19 frame #6: 0x00010a8d9715 libarrow.100.dylib`std::__1::unique_lock::~unique_lock(this=0x7935db70) at __mutex_base:151:5 frame #7: 0x00010a8d96c1 libarrow.100.dylib`arrow::internal::ThreadedTaskGroup::OneTaskDone(this=0x000102373b00) at task_group.cc:152:5 frame #8: 0x00010a8dbe5f libarrow.100.dylib`arrow::internal::ThreadedTaskGroup::AppendReal(this=0x00010232cfe0)>)::'lambda'()::operator()() const at task_group.cc:97:9 frame #9: 0x00010a8dbd9d libarrow.100.dylib`decltype(__f=0x00010232cfe0)>)::'lambda'()&>(fp)()) std::__1::__invoke)::'lambda'()&>(arrow::internal::ThreadedTaskGroup::AppendReal(std::__1::function)::'lambda'()&) at type_traits:4361:1 frame #10: 0x00010a8dbd4d libarrow.100.dylib`void std::__1::__invoke_void_return_wrapper::__call)::'lambda'()&>(arrow::internal::ThreadedTaskGroup::AppendReal(std::__1::function)::'lambda'()&) at __functional_base:349:9 frame #11: 0x00010a8dbd1d libarrow.100.dylib`std::__1::__function::__alloc_func)::'lambda'(), std::__1::allocator)::'lambda'()>, void ()>::operator(this=0x00010232cfe0)() at functional:1527:16 frame #12: 0x00010a8daa59 libarrow.100.dylib`std::__1::__function::__func)::'lambda'(), std::__1::allocator)::'lambda'()>, void ()>::operator(this=0x00010232cfd0)() at functional:1651:12 frame #13: 0x00010a8e5185 libarrow.100.dylib`std::__1::__function::__value_func::operator(this=0x7935ddf0)() const at functional:1799:16 frame #14: 0x00010a8e4d35 libarrow.100.dylib`std::__1::function::operator(this=0x7935ddf0)() const at functional:2347:12 frame #15: 0x00010a8e46fa libarrow.100.dylib`arrow::internal::WorkerLoop(state=std::__1::shared_ptr::element_type @ 0x00010092e548 strong=17 weak=1, it=std::__1::list >::iterator @ 0x7935dde8) at thread_pool.cc:88:9 frame #16: 0x00010a8e4451 libarrow.100.dylib`arrow::internal::ThreadPool::LaunchWorkersUnlocked(this=0x000100a1cde8)::$_1::operator()() const at thread_pool.cc:225:37 frame #17: 0x00010a8e43cd libarrow.100.dylib`decltype(__f=0x000100a1cde8)::$_1>(fp)()) std::__1::__invoke(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_1&&) at type_traits:4361:1 frame #18: 0x00010a8e4335 libarrow.100.dylib`void std::__1::__thread_execute >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_1>(__t=size=2, (null)=__tuple_indices<> @ 0x7935deb8)::$_1>&, std::__1::__tuple_indices<>) at thread:342:5 frame #19: 0x00010a8e3b16 libarrow.100.dylib`void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_1> >(__vp=0x000100a1cde0) at thread:352:5 frame #20: 0x7fff729cb2eb libsystem_pthread.dylib`_pthread_body + 126 frame #21: 0x7fff729ce249 libsystem_pthread.dylib`_pthread_start + 66 frame #22: 0x7fff729ca40d libsystem_pthread.dylib`thread_start + 13 {code} > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > :
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958295#comment-16958295 ] Antoine Pitrou commented on ARROW-6977: --- You can try running the executable using "gdb --args". Another solution is to enable core dumps (perhaps "ulimit -c unlimited") and then run gdb on the core dump, like this: "gdb executable_file core_file". Once under the gdb, use "run" to run the application and then "bt" to get a backtrace. If debugging a core dump, you only need "bt". > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958294#comment-16958294 ] Neal Richardson commented on ARROW-6977: Update: I wiped my build dir and rebuilt with {{-DARROW_JEMALLOC=OFF}}. I no longer get the warning message, but I'm still able to trigger this abort. So it seems there's something new in master that triggers this, but it may not be jemalloc background thread. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958290#comment-16958290 ] Neal Richardson commented on ARROW-6977: With some handholding, I'm sure I could. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958288#comment-16958288 ] Antoine Pitrou commented on ARROW-6977: --- [~npr] can you produce a gdb backtrace for that error? > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958287#comment-16958287 ] Antoine Pitrou commented on ARROW-6977: --- The error message is misleading, it's about missing another system call. Unless we can find a reliable version check, disabling the background thread on macOS may be the safest course of action. [~uwe] > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6982) [R] Add bindings for compare and boolean kernels
Neal Richardson created ARROW-6982: -- Summary: [R] Add bindings for compare and boolean kernels Key: ARROW-6982 URL: https://issues.apache.org/jira/browse/ARROW-6982 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Romain Francois Fix For: 1.0.0 See cpp/src/arrow/compute/kernels/compare.h and boolean.h. ARROW-6980 introduces an Expression class that works on Arrow Arrays, but to evaluate the expressions, it has to pull the data into R first. This would enable us to do the work in C++ and only pull in the result. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6960) [R] Add support for more compression codecs in Windows build
[ https://issues.apache.org/jira/browse/ARROW-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6960: --- Summary: [R] Add support for more compression codecs in Windows build (was: [R] Add information about zstd/lz4 codec installation and linkages for R users) > [R] Add support for more compression codecs in Windows build > > > Key: ARROW-6960 > URL: https://issues.apache.org/jira/browse/ARROW-6960 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.15.0 > Environment: Windows 10 >Reporter: Grant Nguyen >Priority: Minor > > When I attempt to write a parquet file using lz4, zstd, or brotli compression > using R arrow 0.15.0, I am unable to do so due to the codec support not being > built (example below). > > {code:java} > > arrow::write_parquet(payout_strategy, sink = > > "records_test_lz4.parquet",compression = "lz4") > Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : > Arrow error: IOError: Arrow error: NotImplemented: LZ4 codec support not > built{code} > > I believe that the error is generated through > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L124-L145], > but I am not sure how to call > {code:java} > install.packages("arrow"){code} > in R to enable the ARROW_WITH_ZSTD/LZ4/BROTLI flags, or whether I should be > doing installing zstd separately from arrow and then doing something pre- or > post-install to link zstd with arrow. From > [https://github.com/apache/arrow/issues/1209], it appears that zstd support > has been added to arrow and parquet in general, and the R package readme > ([https://github.com/apache/arrow/tree/master/r)|https://github.com/apache/arrow/tree/master/r] > notes "On macOS and Windows, installing a binary package from CRAN will > handle Arrow's C++ dependencies for you", but I get the sense that does not > apply to zstd. > > Is there guidance as to how to enable zstd and other compression codecs prior > to or after downloading the R arrow package? Could this be added to the R > documentation somewhere for future reference? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6981) [R] Implement HDFS file-system interface in R
Neal Richardson created ARROW-6981: -- Summary: [R] Implement HDFS file-system interface in R Key: ARROW-6981 URL: https://issues.apache.org/jira/browse/ARROW-6981 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3750) [R] Pass various wrapped Arrow objects created in Python into R with zero copy via reticulate
[ https://issues.apache.org/jira/browse/ARROW-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958255#comment-16958255 ] Neal Richardson commented on ARROW-3750: https://github.com/pitrou/arrow/pull/5 is our proof-of-concept using the C API. Once the protocol is approved we can move ahead with it. > [R] Pass various wrapped Arrow objects created in Python into R with zero > copy via reticulate > - > > Key: ARROW-3750 > URL: https://issues.apache.org/jira/browse/ARROW-3750 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Wes McKinney >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > A user may wish to use some functionality available only in pyarrow using > reticulate; it would be useful to be able to construct an R wrapper object to > the C++ object inside the corresponding Python type, e.g. {{pyarrow.Table}}. > This probably will require some new functions to return the memory address of > the shared_ptr/unique_ptr inside the Cython types so that a function on the R > side can copy the smart pointer and create the corresponding R wrapper type > cc [~pitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958253#comment-16958253 ] Wes McKinney commented on ARROW-6977: - Not having pthread seems a bit weird to me, I'm not sure what that is all about? > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-3750) [R] Pass various wrapped Arrow objects created in Python into R with zero copy via reticulate
[ https://issues.apache.org/jira/browse/ARROW-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-3750: -- Assignee: Neal Richardson > [R] Pass various wrapped Arrow objects created in Python into R with zero > copy via reticulate > - > > Key: ARROW-3750 > URL: https://issues.apache.org/jira/browse/ARROW-3750 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Wes McKinney >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > A user may wish to use some functionality available only in pyarrow using > reticulate; it would be useful to be able to construct an R wrapper object to > the C++ object inside the corresponding Python type, e.g. {{pyarrow.Table}}. > This probably will require some new functions to return the memory address of > the shared_ptr/unique_ptr inside the Cython types so that a function on the R > side can copy the smart pointer and create the corresponding R wrapper type > cc [~pitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3783) [R] Incorrect collection of float type
[ https://issues.apache.org/jira/browse/ARROW-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958252#comment-16958252 ] Neal Richardson commented on ARROW-3783: [~javierluraschi] is this still an issue? I don't have spark locally, but this works now: {code} > Array$create(1L, type=float32()) Array [ 1 ] {code} It looks like halffloat isn't supported, but that sounds like a different issue {code} > Array$create(1L, type=float16()) Error in Array__from_vector(x, type) : NotImplemented: type not implemented {code} > [R] Incorrect collection of float type > -- > > Key: ARROW-3783 > URL: https://issues.apache.org/jira/browse/ARROW-3783 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > Repro from `sparklyr`: > > {code:java} > library(sparklyr) > library(arrow) > sc <- spark_connect(master = "local") > DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code} > > Actual: > {code:java} > CAST(1 AS FLOAT) > 1 1065353216{code} > Expected: > > {code:java} > CAST(1 AS FLOAT) > 11{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3783) [R] Incorrect collection of float type
[ https://issues.apache.org/jira/browse/ARROW-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-3783: --- Issue Type: Bug (was: Improvement) > [R] Incorrect collection of float type > -- > > Key: ARROW-3783 > URL: https://issues.apache.org/jira/browse/ARROW-3783 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > Repro from `sparklyr`: > > {code:java} > library(sparklyr) > library(arrow) > sc <- spark_connect(master = "local") > DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code} > > Actual: > {code:java} > CAST(1 AS FLOAT) > 1 1065353216{code} > Expected: > > {code:java} > CAST(1 AS FLOAT) > 11{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6980) [R] dplyr backend for RecordBatch/Table
[ https://issues.apache.org/jira/browse/ARROW-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6980: -- Labels: pull-request-available (was: ) > [R] dplyr backend for RecordBatch/Table > --- > > Key: ARROW-6980 > URL: https://issues.apache.org/jira/browse/ARROW-6980 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6980) [R] dplyr backend for RecordBatch/Table
Neal Richardson created ARROW-6980: -- Summary: [R] dplyr backend for RecordBatch/Table Key: ARROW-6980 URL: https://issues.apache.org/jira/browse/ARROW-6980 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958219#comment-16958219 ] Neal Richardson commented on ARROW-6977: That said, I am experiencing this error occasionally today while running the R test suite locally. This is on master: {code} ... CsvTableReader: ..Assertion failed: (ec == 0), function unlock, file /BuildRoot/Library/Caches/com.apple.xbs/Sources/libcxx/libcxx-400.9.4/src/mutex.cpp, line 48. /bin/sh: line 1: 59468 Abort trap: 6 R --slave -e 'library(testthat); setwd(file.path(.libPaths()[1], "arrow", "tests")); system.time(test_check("arrow", filter="", reporter=ifelse(nchar(""), "", "summary")))' make: *** [test] Error 134 {code} > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958216#comment-16958216 ] Antoine Pitrou commented on ARROW-6977: --- Ah... can you run the tests fine? Also the C++ tests. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958215#comment-16958215 ] Neal Richardson commented on ARROW-6977: It's just a message, it doesn't appear to error, at least not immediately. > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958214#comment-16958214 ] Antoine Pitrou edited comment on ARROW-6977 at 10/23/19 8:14 PM: - I wonder why this didn't come up on CI. Is macOS younger on Travis? [~kszucs] was (Author: pitrou): I wonder why this didn't come up on CI. Is macOS younger on Travis? [~kszucs] > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6977: -- Fix Version/s: 0.15.1 > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958214#comment-16958214 ] Antoine Pitrou commented on ARROW-6977: --- I wonder why this didn't come up on CI. Is macOS younger on Travis? [~kszucs] > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958212#comment-16958212 ] Neal Richardson commented on ARROW-6977: I'm not sure, but it sounds like the kind of thing we'll hear bug reports about if we don't. For the CRAN R packages, it's not an issue because the macOS and Windows binaries are built with jemalloc disabled: * https://github.com/apache/arrow/blob/59a6788c76330cf055bdbcbc7bdae7b0106c6656/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb#L47 * https://github.com/apache/arrow/blob/59a6788c76330cf055bdbcbc7bdae7b0106c6656/ci/PKGBUILD#L85 > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
[ https://issues.apache.org/jira/browse/ARROW-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958205#comment-16958205 ] Antoine Pitrou commented on ARROW-6977: --- This sounds critical to get in for 0.15.1, right? > [C++] Only enable jemalloc background_thread if feature is supported > > > Key: ARROW-6977 > URL: https://issues.apache.org/jira/browse/ARROW-6977 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: macOS 10.14, Homebrew >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-6910. When loading the R package after that patch merged, I > get this new message: > {code} > $ R > > library(arrow) > : option background_thread currently supports pthread only > {code} > https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 > is where the message comes from. Tracing that further, > {{have_background_thread}} comes from > https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, > which gets set in {{configure.ac}} here: > https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 > In sum, on my system, that flag doesn't get set, so > {{have_background_thread}} is false, and when that is false and the > {{background_thread}} option is true, I get that message printed. And I do > not want to see that message. > cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6979) [R] Enable jemalloc in autobrew formula
Neal Richardson created ARROW-6979: -- Summary: [R] Enable jemalloc in autobrew formula Key: ARROW-6979 URL: https://issues.apache.org/jira/browse/ARROW-6979 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Fix For: 1.0.0 See https://github.com/apache/arrow/blob/59a6788c76330cf055bdbcbc7bdae7b0106c6656/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb#L47 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6964) [C++][Dataset] Expose a nested parellel option for Scanner::ToTable
[ https://issues.apache.org/jira/browse/ARROW-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6964: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Expose a nested parellel option for Scanner::ToTable > --- > > Key: ARROW-6964 > URL: https://issues.apache.org/jira/browse/ARROW-6964 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6978) [R] Add bindings for sum and mean compute kernels
Neal Richardson created ARROW-6978: -- Summary: [R] Add bindings for sum and mean compute kernels Key: ARROW-6978 URL: https://issues.apache.org/jira/browse/ARROW-6978 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Romain Francois Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6977) [C++] Only enable jemalloc background_thread if feature is supported
Neal Richardson created ARROW-6977: -- Summary: [C++] Only enable jemalloc background_thread if feature is supported Key: ARROW-6977 URL: https://issues.apache.org/jira/browse/ARROW-6977 Project: Apache Arrow Issue Type: Bug Components: C++ Environment: macOS 10.14, Homebrew Reporter: Neal Richardson Fix For: 1.0.0 Followup to ARROW-6910. When loading the R package after that patch merged, I get this new message: {code} $ R > library(arrow) : option background_thread currently supports pthread only {code} https://github.com/jemalloc/jemalloc/blob/3d84bd57f4954a17059bd31330ec87d3c1876411/src/background_thread.c#L884-L887 is where the message comes from. Tracing that further, {{have_background_thread}} comes from https://github.com/jemalloc/jemalloc/blob/21cfe59ff7b10a61dabe26cd3dbfb7a255e1f5e8/include/jemalloc/internal/jemalloc_preamble.h.in#L205-L211, which gets set in {{configure.ac}} here: https://github.com/jemalloc/jemalloc/blob/d2dddfb82aac9f2212922eb90324e84790704bfe/configure.ac#L2155-L2157 In sum, on my system, that flag doesn't get set, so {{have_background_thread}} is false, and when that is false and the {{background_thread}} option is true, I get that message printed. And I do not want to see that message. cc [~wesm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6960) [R] Add information about zstd/lz4 codec installation and linkages for R users
[ https://issues.apache.org/jira/browse/ARROW-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958064#comment-16958064 ] Neal Richardson commented on ARROW-6960: Sounds good. After you work out the lz4, if you wanted to move on to zstd, you could start by copying https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-zstd/PKGBUILD to rtools-packages (fork it and make a PR adding it). Appveyor will test it for you, and Jeroen can help you with the details. https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-brotli/PKGBUILD exists too but looks a little more involved because you'd probably want to prune the python-specific build targets. > [R] Add information about zstd/lz4 codec installation and linkages for R users > -- > > Key: ARROW-6960 > URL: https://issues.apache.org/jira/browse/ARROW-6960 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.15.0 > Environment: Windows 10 >Reporter: Grant Nguyen >Priority: Minor > > When I attempt to write a parquet file using lz4, zstd, or brotli compression > using R arrow 0.15.0, I am unable to do so due to the codec support not being > built (example below). > > {code:java} > > arrow::write_parquet(payout_strategy, sink = > > "records_test_lz4.parquet",compression = "lz4") > Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : > Arrow error: IOError: Arrow error: NotImplemented: LZ4 codec support not > built{code} > > I believe that the error is generated through > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L124-L145], > but I am not sure how to call > {code:java} > install.packages("arrow"){code} > in R to enable the ARROW_WITH_ZSTD/LZ4/BROTLI flags, or whether I should be > doing installing zstd separately from arrow and then doing something pre- or > post-install to link zstd with arrow. From > [https://github.com/apache/arrow/issues/1209], it appears that zstd support > has been added to arrow and parquet in general, and the R package readme > ([https://github.com/apache/arrow/tree/master/r)|https://github.com/apache/arrow/tree/master/r] > notes "On macOS and Windows, installing a binary package from CRAN will > handle Arrow's C++ dependencies for you", but I get the sense that does not > apply to zstd. > > Is there guidance as to how to enable zstd and other compression codecs prior > to or after downloading the R arrow package? Could this be added to the R > documentation somewhere for future reference? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6960) [R] Add information about zstd/lz4 codec installation and linkages for R users
[ https://issues.apache.org/jira/browse/ARROW-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958058#comment-16958058 ] Grant Nguyen edited comment on ARROW-6960 at 10/23/19 5:33 PM: --- Thanks [~npr] for the very detailed explanation, that helps a lot! I will look into this in the next few days – the lz4 addition to PKGBUILD seems like a good starting point – not sure that I have quite the level of expertise to add zstd and brotli to rtools but will investigate further. was (Author: gngu): Thanks [~npr] for the very detailed explanation, that helps a lot! I will look into this in the next few days – the lz4 seems like a good starting point – not sure that I have quite the level of expertise to add zstd and brotli to rtools but will investigate further. > [R] Add information about zstd/lz4 codec installation and linkages for R users > -- > > Key: ARROW-6960 > URL: https://issues.apache.org/jira/browse/ARROW-6960 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.15.0 > Environment: Windows 10 >Reporter: Grant Nguyen >Priority: Minor > > When I attempt to write a parquet file using lz4, zstd, or brotli compression > using R arrow 0.15.0, I am unable to do so due to the codec support not being > built (example below). > > {code:java} > > arrow::write_parquet(payout_strategy, sink = > > "records_test_lz4.parquet",compression = "lz4") > Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : > Arrow error: IOError: Arrow error: NotImplemented: LZ4 codec support not > built{code} > > I believe that the error is generated through > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L124-L145], > but I am not sure how to call > {code:java} > install.packages("arrow"){code} > in R to enable the ARROW_WITH_ZSTD/LZ4/BROTLI flags, or whether I should be > doing installing zstd separately from arrow and then doing something pre- or > post-install to link zstd with arrow. From > [https://github.com/apache/arrow/issues/1209], it appears that zstd support > has been added to arrow and parquet in general, and the R package readme > ([https://github.com/apache/arrow/tree/master/r)|https://github.com/apache/arrow/tree/master/r] > notes "On macOS and Windows, installing a binary package from CRAN will > handle Arrow's C++ dependencies for you", but I get the sense that does not > apply to zstd. > > Is there guidance as to how to enable zstd and other compression codecs prior > to or after downloading the R arrow package? Could this be added to the R > documentation somewhere for future reference? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6960) [R] Add information about zstd/lz4 codec installation and linkages for R users
[ https://issues.apache.org/jira/browse/ARROW-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958058#comment-16958058 ] Grant Nguyen commented on ARROW-6960: - Thanks [~npr] for the very detailed explanation, that helps a lot! I will look into this in the next few days – the lz4 seems like a good starting point – not sure that I have quite the level of expertise to add zstd and brotli to rtools but will investigate further. > [R] Add information about zstd/lz4 codec installation and linkages for R users > -- > > Key: ARROW-6960 > URL: https://issues.apache.org/jira/browse/ARROW-6960 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.15.0 > Environment: Windows 10 >Reporter: Grant Nguyen >Priority: Minor > > When I attempt to write a parquet file using lz4, zstd, or brotli compression > using R arrow 0.15.0, I am unable to do so due to the codec support not being > built (example below). > > {code:java} > > arrow::write_parquet(payout_strategy, sink = > > "records_test_lz4.parquet",compression = "lz4") > Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : > Arrow error: IOError: Arrow error: NotImplemented: LZ4 codec support not > built{code} > > I believe that the error is generated through > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L124-L145], > but I am not sure how to call > {code:java} > install.packages("arrow"){code} > in R to enable the ARROW_WITH_ZSTD/LZ4/BROTLI flags, or whether I should be > doing installing zstd separately from arrow and then doing something pre- or > post-install to link zstd with arrow. From > [https://github.com/apache/arrow/issues/1209], it appears that zstd support > has been added to arrow and parquet in general, and the R package readme > ([https://github.com/apache/arrow/tree/master/r)|https://github.com/apache/arrow/tree/master/r] > notes "On macOS and Windows, installing a binary package from CRAN will > handle Arrow's C++ dependencies for you", but I get the sense that does not > apply to zstd. > > Is there guidance as to how to enable zstd and other compression codecs prior > to or after downloading the R arrow package? Could this be added to the R > documentation somewhere for future reference? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6971) [Rust] Replace "RecordBatchReader" with "BatchIterator"
[ https://issues.apache.org/jira/browse/ARROW-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan closed ARROW-6971. -- Fix Version/s: (was: 1.0.0) Resolution: Not A Bug > [Rust] Replace "RecordBatchReader" with "BatchIterator" > --- > > Key: ARROW-6971 > URL: https://issues.apache.org/jira/browse/ARROW-6971 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Affects Versions: 0.15.0 >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Minor > > As part of the recent reader work we introduced > {code:java} > // arrow::record_batch::RecordBatchReader{code} > but in datafusion we have > {code:java} > // datafusion::physical_plan::BatchIterator > {code} > These two trait are almost identical (BatchIterator implements Send + Sync > whereas RecordBatchReader does not). I propose we replace RecordBatchReader > with BatchIterator (i.e. move it to arrow as it's generally useful outside of > datafusion) and update parquet and data fusion accordingly. > [~andygrove] [~liurenjie1024] do you see any issues with this? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6971) [Rust] Replace "RecordBatchReader" with "BatchIterator"
[ https://issues.apache.org/jira/browse/ARROW-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958012#comment-16958012 ] Paddy Horan commented on ARROW-6971: Ahh, ok then. > [Rust] Replace "RecordBatchReader" with "BatchIterator" > --- > > Key: ARROW-6971 > URL: https://issues.apache.org/jira/browse/ARROW-6971 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Affects Versions: 0.15.0 >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Minor > Fix For: 1.0.0 > > > As part of the recent reader work we introduced > {code:java} > // arrow::record_batch::RecordBatchReader{code} > but in datafusion we have > {code:java} > // datafusion::physical_plan::BatchIterator > {code} > These two trait are almost identical (BatchIterator implements Send + Sync > whereas RecordBatchReader does not). I propose we replace RecordBatchReader > with BatchIterator (i.e. move it to arrow as it's generally useful outside of > datafusion) and update parquet and data fusion accordingly. > [~andygrove] [~liurenjie1024] do you see any issues with this? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6969) [C++][Dataset] ParquetScanTask eagerly load file
[ https://issues.apache.org/jira/browse/ARROW-6969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6969: - Assignee: Francois Saint-Jacques > [C++][Dataset] ParquetScanTask eagerly load file > - > > Key: ARROW-6969 > URL: https://issues.apache.org/jira/browse/ARROW-6969 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset > > The file content should only be read when invoking ParquetScanTask::Scan, not > on construction. This blocks reading in a true streaming fashion with memory > constraints. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6964) [C++][Dataset] Expose a nested parellel option for Scanner::ToTable
[ https://issues.apache.org/jira/browse/ARROW-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6964: - Assignee: Francois Saint-Jacques > [C++][Dataset] Expose a nested parellel option for Scanner::ToTable > --- > > Key: ARROW-6964 > URL: https://issues.apache.org/jira/browse/ARROW-6964 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6950) [C++][Dataset] Add example/benchmark for reading parquet files with dataset
[ https://issues.apache.org/jira/browse/ARROW-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6950: - Assignee: Francois Saint-Jacques > [C++][Dataset] Add example/benchmark for reading parquet files with dataset > --- > > Key: ARROW-6950 > URL: https://issues.apache.org/jira/browse/ARROW-6950 > Project: Apache Arrow > Issue Type: Test > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Create an executable that load a directory with a known partition scheme with > a filter and a projection. This will be used as a baseline for future > performance improvement but also to show various feature of the dataset API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6976) Possible memory leak in pyarrow read_parquet
david cottrell created ARROW-6976: - Summary: Possible memory leak in pyarrow read_parquet Key: ARROW-6976 URL: https://issues.apache.org/jira/browse/ARROW-6976 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0 Environment: linux ubuntu 18.04 Reporter: david cottrell Attachments: image-2019-10-23-16-17-20-739.png Version and repro info in the gist below. Not sure if I'm not understanding something from this [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/] but there seems to be memory accumulation when that is exacerbated with higher arity objects like strings and dates (not datetimes). I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed to "fix" or lessen the problem. [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62] Let me know if this post should go elsewhere. !image-2019-10-23-16-17-20-739.png! {code:java} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6950) [C++][Dataset] Add example/benchmark for reading parquet files with dataset
[ https://issues.apache.org/jira/browse/ARROW-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6950: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Add example/benchmark for reading parquet files with dataset > --- > > Key: ARROW-6950 > URL: https://issues.apache.org/jira/browse/ARROW-6950 > Project: Apache Arrow > Issue Type: Test > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > > Create an executable that load a directory with a known partition scheme with > a filter and a projection. This will be used as a baseline for future > performance improvement but also to show various feature of the dataset API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6749) [Python] Conversion of non-ns timestamp array to numpy gives wrong values
[ https://issues.apache.org/jira/browse/ARROW-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6749: -- Labels: pull-request-available (was: ) > [Python] Conversion of non-ns timestamp array to numpy gives wrong values > - > > Key: ARROW-6749 > URL: https://issues.apache.org/jira/browse/ARROW-6749 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > > {code} > In [25]: np_arr = np.arange("2012-01-01", "2012-01-06", int(1e6)*60*60*24, > dtype="datetime64[us]") > > In [26]: np_arr > > > Out[26]: > array(['2012-01-01T00:00:00.00', '2012-01-02T00:00:00.00', >'2012-01-03T00:00:00.00', '2012-01-04T00:00:00.00', >'2012-01-05T00:00:00.00'], dtype='datetime64[us]') > In [27]: arr = pa.array(np_arr) > > > In [28]: arr > > > Out[28]: > > [ > 2012-01-01 00:00:00.00, > 2012-01-02 00:00:00.00, > 2012-01-03 00:00:00.00, > 2012-01-04 00:00:00.00, > 2012-01-05 00:00:00.00 > ] > In [29]: arr.type > > > Out[29]: TimestampType(timestamp[us]) > In [30]: arr.to_numpy() > > > Out[30]: > array(['1970-01-16T08:09:36.0', '1970-01-16T08:11:02.4', >'1970-01-16T08:12:28.8', '1970-01-16T08:13:55.2', >'1970-01-16T08:15:21.6'], dtype='datetime64[ns]') > {code} > So it seems to simply interpret the integer microsecond values as nanoseconds > when converting to numpy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6503) [C++] Add an argument of memory pool object to SparseTensorConverter
[ https://issues.apache.org/jira/browse/ARROW-6503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6503. --- Resolution: Fixed Issue resolved by pull request 5707 [https://github.com/apache/arrow/pull/5707] > [C++] Add an argument of memory pool object to SparseTensorConverter > > > Key: ARROW-6503 > URL: https://issues.apache.org/jira/browse/ARROW-6503 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > According to the comment > https://github.com/apache/arrow/pull/5290#discussion_r322244745, we need to > have variants of some functions for supplying a memory pool object to > SparseTensorConverter function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6973) [C++][ThreadPool] Use perfect forwarding in Submit
[ https://issues.apache.org/jira/browse/ARROW-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6973. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5711 [https://github.com/apache/arrow/pull/5711] > [C++][ThreadPool] Use perfect forwarding in Submit > -- > > Key: ARROW-6973 > URL: https://issues.apache.org/jira/browse/ARROW-6973 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Artem Alekseev >Assignee: Artem Alekseev >Priority: Trivial > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6975) [C++] Put make_unique in its own header
Antoine Pitrou created ARROW-6975: - Summary: [C++] Put make_unique in its own header Key: ARROW-6975 URL: https://issues.apache.org/jira/browse/ARROW-6975 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou {{arrow/util/stl.h}} carries other stuff that is almost never necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3543) [R] Better support for timestamp format and time zones in R
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957889#comment-16957889 ] Wes McKinney commented on ARROW-3543: - There will be an update on JIRA when there is activity. > [R] Better support for timestamp format and time zones in R > --- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Olaf >Priority: Major > Fix For: 1.0.0 > > > See below for original description and reports. In sum, there is a mismatch > between how the C++ library and R interpret data without a timezone, and it > turns out that we're not passing the timezone to R if it is set in Arrow C++ > anyway. > The [C++ library > docs|http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE] > say "If a timezone-aware field contains a recognized timezone, its values > may be localized to that locale upon display; the values of timezone-naive > fields must always be displayed “as is”, with no localization performed on > them." But R's print default, as well as the parsing default, is the current > time zone: > https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html > The C++ library seems to parse timestamp strings that don't have timezone > information as if they are UTC, so when you read timezone-naive timestamps > from Arrow and print them in R, they are shifted to be localized to the > current timezone. If you print timestamp data from Arrow with > {{print(timestamp_var, tz="GMT")}} it would look as you expect. > On further inspection, the [arrow-to-vector code for > timestamp|https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514] > doesn't seem to consider time zone information even if it does exist. So we > don't have the means currently in R to display timestamp data faithfully, > whether or not it is timezone-aware. > Among the tasks here: > * Include the timezone attribute in the POSIXct R vector that gets created > from a timestamp Arrow array > * Ensure that timezone-naive data from Arrow is printed in R "as is" with no > localization > - > Original description: > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been
[jira] [Commented] (ARROW-3543) [R] Better support for timestamp format and time zones in R
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957884#comment-16957884 ] Shannon C Lewis commented on ARROW-3543: Just checking in any updates on this? > [R] Better support for timestamp format and time zones in R > --- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Olaf >Priority: Major > Fix For: 1.0.0 > > > See below for original description and reports. In sum, there is a mismatch > between how the C++ library and R interpret data without a timezone, and it > turns out that we're not passing the timezone to R if it is set in Arrow C++ > anyway. > The [C++ library > docs|http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE] > say "If a timezone-aware field contains a recognized timezone, its values > may be localized to that locale upon display; the values of timezone-naive > fields must always be displayed “as is”, with no localization performed on > them." But R's print default, as well as the parsing default, is the current > time zone: > https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html > The C++ library seems to parse timestamp strings that don't have timezone > information as if they are UTC, so when you read timezone-naive timestamps > from Arrow and print them in R, they are shifted to be localized to the > current timezone. If you print timestamp data from Arrow with > {{print(timestamp_var, tz="GMT")}} it would look as you expect. > On further inspection, the [arrow-to-vector code for > timestamp|https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514] > doesn't seem to consider time zone information even if it does exist. So we > don't have the means currently in R to display timestamp data faithfully, > whether or not it is timezone-aware. > Among the tasks here: > * Include the timezone attribute in the POSIXct R vector that gets created > from a timestamp Arrow array > * Ensure that timezone-naive data from Arrow is printed in R "as is" with no > localization > - > Original description: > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure
[jira] [Created] (ARROW-6974) [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern
Joris Van den Bossche created ARROW-6974: Summary: [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern Key: ARROW-6974 URL: https://issues.apache.org/jira/browse/ARROW-6974 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently, the casting for time-like data is done with the {{ShiftTime}} function. It _might_ be possible to simplify this with ArrayDataVisitor (to avoid looping / checking the bitmap). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6958) [Python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6958: - Summary: [Python] tutorial script for arrow in spark throws error (was: [python] tutorial script for arrow in spark throws error) > [Python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6958) [Python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-6958. Resolution: Not A Problem > [Python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6958) [python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6958: - Fix Version/s: (was: 0.8.0) > [python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-6958) [python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reopened ARROW-6958: -- > [python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Fix For: 0.8.0 > > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6958) [python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957749#comment-16957749 ] Joris Van den Bossche commented on ARROW-6958: -- The relevant spark issue is https://issues.apache.org/jira/browse/SPARK-29367 > [python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Fix For: 0.8.0 > > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6958) [python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-6958. Resolution: Not A Problem > [python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Fix For: 0.8.0 > > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6958) [python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957745#comment-16957745 ] Joris Van den Bossche edited comment on ARROW-6958 at 10/23/19 10:43 AM: - pyspark is not yet compatible with the latest pyarrow 0.15.0 version, see eg https://stackoverflow.com/questions/58273063/pandasudf-and-pyarrow-0-15-0 for an explanation and how to solve it. I suppose you are encountering the same issue. was (Author: jorisvandenbossche): pyspark is not yet compatible with the latest pyarrow 0.15.0 version, see eg https://stackoverflow.com/questions/58273063/pandasudf-and-pyarrow-0-15-0 for an explanation > [python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Fix For: 0.8.0 > > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6958) [python] tutorial script for arrow in spark throws error
[ https://issues.apache.org/jira/browse/ARROW-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957745#comment-16957745 ] Joris Van den Bossche commented on ARROW-6958: -- pyspark is not yet compatible with the latest pyarrow 0.15.0 version, see eg https://stackoverflow.com/questions/58273063/pandasudf-and-pyarrow-0-15-0 for an explanation > [python] tutorial script for arrow in spark throws error > > > Key: ARROW-6958 > URL: https://issues.apache.org/jira/browse/ARROW-6958 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.15.0 > Environment: Ubuntu v 18. Cluster spun up on google dataproc - see > startup for specs of cluster >Reporter: Karl Svensson >Priority: Major > Labels: newbie > Fix For: 0.8.0 > > Attachments: arrow_error.txt, start-cluster-nl.ps1.txt > > > Running the arrow example for pyspark ([found here > |[https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py]]) > causes a java.lang.IllegalArgumentException error. Running the same script > with pyarrow v 0.8.0 causes the script to run correctly. > Attached are the startup settings in google dataproc I'm using to create the > cluster, as well as the output (with error text). It isn't immediately > obvious to me what is causing the issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6968) [Python] 0.14.1 to 0.15.0 upgrade produces AttributeError
[ https://issues.apache.org/jira/browse/ARROW-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957733#comment-16957733 ] Joris Van den Bossche commented on ARROW-6968: -- Hi [~mwheeler-hdai], this was a backwards incompatible change in pyarrow 0.15.0. The {{Column}} class (as small wrapper around ChunkedArray) is removed, and a column of a Table is now returned as a {{ChunkedArray}}. In most cases a {{ChunkedArray}} behaves similarly and has similar functionality as a the removed {{Column}}, but one of the differences is that {{ChunkedArray}} has no 'name' attribute. You could replace the {code} map_col_names_to_incides = {item.name: table.columns.index(item) for item in table.columns} {code} with eg {code} map_col_names_to_incides = {name: i for i, name in enumerate(table.column_names)} {code} as the column_names are guaranteed to be in the correct order (or another option: {{dict(zip(table.column_names, range(table.num_columns)))}}). > [Python] 0.14.1 to 0.15.0 upgrade produces AttributeError > - > > Key: ARROW-6968 > URL: https://issues.apache.org/jira/browse/ARROW-6968 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: Python 3.7.4 on macOS Mojave 10.14.6 > Python 3.6.7 on Ubuntu 16.04.6 LTS >Reporter: Michael Wheeler >Priority: Major > Attachments: attribute_error_pyarrow_0_15_0.py > > > The code in question: > {code:java} > """ > Reproduce AttributeError with PyArrow == 0.15.0 > """ > import io > import logging > import pandas > import pyarrow > import sys > import textwrap > logging.basicConfig(level=logging.DEBUG) > logging.debug(f'Python > v{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}') > logging.debug(f'PyArrow v{pyarrow.__version__}' + '\n') > CSV_TEXT = textwrap.dedent("""\ > id,gender,some_date,age > 001,M,01/01/2019,75 > 002,F,02/02/2018,32 > 003,M,03/03/2017,27 > 004,F,04/04/2016,19 > 005,M,05/05/2015,55 > 006,F,06/06/2014,42 > """) > # Initialize pyarrow table via pandas > mock_file = io.StringIO(CSV_TEXT) > df = pandas.read_csv(mock_file).sort_values(['age', 'gender']) > table = pyarrow.Table.from_pandas(df=df) > # This comprehension generates a map between the name of the column and its > index > map_col_names_to_incides = {item.name: table.columns.index(item) for item in > table.columns} > logging.debug('The column indices are:') > for name, index in map_col_names_to_incides.items(): > logging.debug(f'Col {name} -> #{index}') > {code} > > Expected result (generated with 0.14.0): > {code:java} > DEBUG:root:Python v3.7.4 > DEBUG:root:PyArrow v0.14.1 > DEBUG:root:The column indices are: > DEBUG:root:Col id -> #0 > DEBUG:root:Col gender -> #1 > DEBUG:root:Col some_date -> #2 > DEBUG:root:Col age -> #3 > DEBUG:root:Col __index_level_0__ -> #4 > {code} > Actual result (generated with 0.15.0): > {code:java} > DEBUG:root:Python v3.7.4 > DEBUG:root:PyArrow v0.15.0 > Traceback (most recent call last): > File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line > 1758, in > main() > File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line > 1752, in main > globals = debugger.run(setup['file'], None, None, is_module) > File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line > 1147, in run > pydev_imports.execfile(file, globals, locals) # execute the script > File > "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", > line 18, in execfile > exec(compile(contents+"\n", file, 'exec'), glob, loc) > File > "/Users/mwheeler/Library/Preferences/PyCharm2019.1/scratches/scratch.py", > line 31, in > map_col_names_to_incides = {item.name: table.columns.index(item) for item > in table.columns} > File > "/Users/mwheeler/Library/Preferences/PyCharm2019.1/scratches/scratch.py", > line 31, in > map_col_names_to_incides = {item.name: table.columns.index(item) for item > in table.columns} > AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'name' > {code} > > This error occurs in both of the environments specified above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6973) [C++][ThreadPool] Use perfect forwarding in Submit
[ https://issues.apache.org/jira/browse/ARROW-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6973: -- Component/s: C++ > [C++][ThreadPool] Use perfect forwarding in Submit > -- > > Key: ARROW-6973 > URL: https://issues.apache.org/jira/browse/ARROW-6973 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Artem Alekseev >Assignee: Artem Alekseev >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6973) [C++][ThreadPool] Use perfect forwarding in Submit
[ https://issues.apache.org/jira/browse/ARROW-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6973: -- Labels: pull-request-available (was: ) > [C++][ThreadPool] Use perfect forwarding in Submit > -- > > Key: ARROW-6973 > URL: https://issues.apache.org/jira/browse/ARROW-6973 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Artem Alekseev >Assignee: Artem Alekseev >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6749) [Python] Conversion of non-ns timestamp array to numpy gives wrong values
[ https://issues.apache.org/jira/browse/ARROW-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-6749: Assignee: Joris Van den Bossche > [Python] Conversion of non-ns timestamp array to numpy gives wrong values > - > > Key: ARROW-6749 > URL: https://issues.apache.org/jira/browse/ARROW-6749 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > > {code} > In [25]: np_arr = np.arange("2012-01-01", "2012-01-06", int(1e6)*60*60*24, > dtype="datetime64[us]") > > In [26]: np_arr > > > Out[26]: > array(['2012-01-01T00:00:00.00', '2012-01-02T00:00:00.00', >'2012-01-03T00:00:00.00', '2012-01-04T00:00:00.00', >'2012-01-05T00:00:00.00'], dtype='datetime64[us]') > In [27]: arr = pa.array(np_arr) > > > In [28]: arr > > > Out[28]: > > [ > 2012-01-01 00:00:00.00, > 2012-01-02 00:00:00.00, > 2012-01-03 00:00:00.00, > 2012-01-04 00:00:00.00, > 2012-01-05 00:00:00.00 > ] > In [29]: arr.type > > > Out[29]: TimestampType(timestamp[us]) > In [30]: arr.to_numpy() > > > Out[30]: > array(['1970-01-16T08:09:36.0', '1970-01-16T08:11:02.4', >'1970-01-16T08:12:28.8', '1970-01-16T08:13:55.2', >'1970-01-16T08:15:21.6'], dtype='datetime64[ns]') > {code} > So it seems to simply interpret the integer microsecond values as nanoseconds > when converting to numpy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6973) [C++][ThreadPool] Use perfect forwarding in Submit
Artem Alekseev created ARROW-6973: - Summary: [C++][ThreadPool] Use perfect forwarding in Submit Key: ARROW-6973 URL: https://issues.apache.org/jira/browse/ARROW-6973 Project: Apache Arrow Issue Type: Improvement Reporter: Artem Alekseev Assignee: Artem Alekseev -- This message was sent by Atlassian Jira (v8.3.4#803005)