Re: [Format] Semantics for dictionary batches in streams
Yes, I opened a JIRA, I'm going to try to make a proposal that consolidates all the recent dictionary discussions. On Mon, Sep 9, 2019 at 12:21 PM Wes McKinney wrote: > hi Micah, > > I think we should formulate changes to format/Columnar.rst and have a > vote, what do you think? > > On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield > wrote: > >> > >> > >> > I was thinking the file format must satisfy one of two conditions: > >> > 1. Exactly one dictionarybatch per encoded column > >> > 2. DictionaryBatches are interleaved correctly. > >> > >> Could you clarify? > > > > I think you clarified it very well :) My motivation for suggesting the > additional complexity is I see two use-cases for the file format. These > roughly correspond with the two options I suggested: > > 1. We are encoding data from scratch. In this case, it seems like all > dictionaries would be built incrementally, not need replacement and we > write them at the end of the file [1] > > > > 2. The data being written out is essentially a "tee" off of some stream > that is generating new dictionaries requiring replacement on the fly (i.e. > reading back two parquet files). > > > >> It might be better to disallow replacements > >> in the file format (which does introduce semantic slippage between the > >> file and stream formats as Antoine was saying). > > > > It is is certainly possible, to accept the slippage from the stream > format for now and later add this capability, since it should be forwards > compatible. > > > > Thanks, > > Micah > > > > [1] There is also medium complexity option where we require one > non-delta dictionary and as many delta dictionaries as the user want. > > > > On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney > wrote: > >> > >> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield > wrote: > >> > > >> > I was thinking the file format must satisfy one of two conditions: > >> > 1. Exactly one dictionarybatch per encoded column > >> > 2. DictionaryBatches are interleaved correctly. > >> > >> Could you clarify? In the first case, there is no issue with > >> dictionary replacements. I'm not sure about the second case -- if a > >> dictionary id appears twice, then you'll see it twice in the file > >> footer. I suppose you could look at the file offsets to determine > >> whether a dictionary batch precedes a particular record batch block > >> (to know which dictionary you should be using), but that's rather > >> complicated to implement. It might be better to disallow replacements > >> in the file format (which does introduce semantic slippage between the > >> file and stream formats as Antoine was saying). > >> > >> > > >> > On Tuesday, August 27, 2019, Wes McKinney > wrote: > >> > > >> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou > wrote: > >> > > > > >> > > > > >> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > >> > > > > So the current situation we have right now in C++ is that if we > tried > >> > > > > to create an IPC stream from a sequence of record batches that > don't > >> > > > > all have the same dictionary, we'd run into two scenarios: > >> > > > > > >> > > > > * Batches that either have a prefix of a prior-observed > dictionary, or > >> > > > > the prior dictionary is a prefix of their dictionary. For > example, > >> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] > and > >> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. > In > >> > > > > such case we could compute and send a delta batch > >> > > > > > >> > > > > * Batches with a dictionary that is a permutation of values, and > >> > > > > possibly new unique values. > >> > > > > > >> > > > > In this latter case, without the option of replacing an > existing ID in > >> > > > > the stream, we would have to do a unification / permutation of > indices > >> > > > > and then also possibly send a delta batch. We should probably > have > >> > > > > code at some point that deals with both cases, but in the > meantime I > >> > > > > would like to allow dictionaries to be redefined in this case. > Seems > >> > > > > like we might need a vote to formalize this? > >> > > > > >> > > > Isn't the stream format deviating from the file format then? In > the > >> > > > file format, IIUC, dictionaries can appear after the respective > record > >> > > > batches, so there's no way to tell whether the original or > redefined > >> > > > version of a dictionary is being referred to. > >> > > > >> > > You make a good point -- we can consider changes to the file format > to > >> > > allow for record batches to have different dictionaries. Even > handling > >> > > delta dictionaries with the current file format would be a bit > tedious > >> > > (though not indeterminate) > >> > > > >> > > > Regards > >> > > > > >> > > > Antoine. > >> > > >
Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)
Sounds good to me also and I don't think we need a vote either. On Sat, Sep 7, 2019 at 7:36 PM Micah Kornfield wrote: > +1 on this, I also don't think a vote is necessary as long as we make the > change before 0.15.0 > > On Saturday, September 7, 2019, Wes McKinney wrote: > > > I see, thank you for catching this nuance. > > > > I agree that using {0x, 0x} for EOS will resolve the > > issue while allowing implementations to be backwards compatible (i.e. > > handling the 4-byte EOS from older payloads). > > > > I'm not sure that we need to have a vote about this, what do others > think? > > > > On Sat, Sep 7, 2019 at 12:47 AM Ji Liu > wrote: > > > > > > Hi all, > > > > > > During the java code review[1], seems there is a problem with the > > current implementations(C++/Java etc) when reaching EOS, since the new > > format EOS is 8 bytes and the reader only reads 4 bytes when reach the > end > > of stream, and the additional 4 bytes will not be read which cause > problems > > for following up readings. > > > > > > There are some optional suggestions[2] as below, we should reach > > consistent and fix this problem before 0.15 release. > > > i. For the new format, an 8-byte EOS token should look like > {0x, > > 0x}, so we read the continuation token first, and then know to > read > > the next 4 bytes, which are then 0 to signal EOS.ii. Reader just remember > > the state, so if it reads the continuation token from the beginning, then > > read all 8 bytes at the end. > > > > > > Thanks, > > > Ji Liu > > > > > > [1] https://github.com/apache/arrow/pull/5229 > > > [2] https://github.com/apache/arrow/pull/5229#discussion_r321715682 > > > > > > > > > > > > > > > -- > > > From:Eric Erhardt > > > Send Time:2019年9月5日(星期四) 07:16 > > > To:dev@arrow.apache.org ; Ji Liu < > > niki...@aliyun.com> > > > Cc:emkornfield ; Paul Taylor < > ptay...@apache.org> > > > Subject:RE: [RESULT] [VOTE] Alter Arrow binary protocol to address > > 8-byte Flatbuffer alignment requirements (2nd vote) > > > > > > The C# PR is up. > > > > > > https://github.com/apache/arrow/pull/5280 > > > > > > Eric > > > > > > -Original Message- > > > From: Eric Erhardt > > > Sent: Wednesday, September 4, 2019 10:12 AM > > > To: dev@arrow.apache.org; Ji Liu > > > Cc: emkornfield ; Paul Taylor < > ptay...@apache.org > > > > > > Subject: RE: [RESULT] [VOTE] Alter Arrow binary protocol to address > > 8-byte Flatbuffer alignment requirements (2nd vote) > > > > > > I'm working on a PR for the C# bindings. I hope to have it up in the > > next day or two. Integration tests for C# would be a great addition at > some > > point - it's been on my backlog. For now I plan on manually testing it. > > > > > > -Original Message- > > > From: Wes McKinney > > > Sent: Tuesday, September 3, 2019 10:17 PM > > > To: Ji Liu > > > Cc: emkornfield ; dev ; > > Paul Taylor > > > Subject: Re: [RESULT] [VOTE] Alter Arrow binary protocol to address > > 8-byte Flatbuffer alignment requirements (2nd vote) > > > > > > hi folks, > > > > > > We now have patches up for Java, JS, and Go. How are we doing on the > > code reviews for getting these in? > > > > > > Since C# implements the binary protocol, the C# developers might want > to > > look at this before the 0.15.0 release also. Absent integration tests > it's > > difficult to verify the C# library, though > > > > > > Thanks > > > > > > On Thu, Aug 29, 2019 at 8:13 AM Ji Liu wrote: > > > > > > > > Here is the Java implementation > > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith > > > > ub.com > %2Fapache%2Farrow%2Fpull%2F5229data=02%7C01%7CEric.Erhardt% > > > > 40microsoft.com > %7C90f02600c4ce40ff5c9008d730e66b68%7C72f988bf86f141af9 > > > > > 1ab2d7cd011db47%7C1%7C0%7C637031638512163816sdata=b87u5x8lLvfdnU5 > > > > 6LrGzYR8H0Jh8FfwY2cVjbOsY9hY%3Dreserved=0 > > > > > > > > cc @Wes McKinney @emkornfield > > > > > > > > Thanks, > > > > Ji Liu > > > > > > > > -- > > > > From:Ji Liu Send Time:2019年8月28日(星期三) > > > > 17:34 To:emkornfield ; dev > > > > Cc:Paul Taylor > > > > Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address > > > > 8-byte Flatbuffer alignment requirements (2nd vote) > > > > > > > > I could take the Java implementation and will take a close watch on > > this issue in the next few days. > > > > > > > > Thanks, > > > > Ji Liu > > > > > > > > > > > > -- > > > > From:Micah Kornfield Send > Time:2019年8月28日(星期三) > > > > 17:14 To:dev Cc:Paul Taylor > > > > > > > > Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address > > > > 8-byte Flatbuffer alignment requirements (2nd vote) > > > > > > > > I should have integration tests with 0.14.1 generated binaries in the > > > > next few days. I think the one remaining unassigned piece
[Discuss] [Java] DateMilliVector.getObject() return type (LocalDateTime vs LocalDate)
Yongbo Zhang, Opened up a pull request to have DateMilliVector return a LocalDate instead of a LocalDateTime object. Do people have opinions if this breaking change is worth the correctness? Thanks, Micah [1] https://github.com/apache/arrow/pull/5315 On Sat, Sep 7, 2019 at 4:14 PM Yongbo Zhang wrote: > Summary: [Java] DateMilliVector.getObject() should return a LocalDate, not > a LocalDateTime > Key: ARROW-1984 > URL: https://issues.apache.org/jira/browse/ARROW-1984 > Pull Request: https://github.com/apache/arrow/pull/5315 > Project: Apache Arrow > Issue Type: Bug > Components: Java > Reporter: Vanco Buca > Assignee: Yongbo Zhang > Fix For: 0.15.0 > > This is an API breaking change therefore we may want to discuss about it > before merging any PRs in. >
[jira] [Created] (ARROW-6504) [Python][Packaging] Add mimalloc to Windows conda packages for better performance
Wes McKinney created ARROW-6504: --- Summary: [Python][Packaging] Add mimalloc to Windows conda packages for better performance Key: ARROW-6504 URL: https://issues.apache.org/jira/browse/ARROW-6504 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.15.0 -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: Plasma scenarios
If we build the GLib-based library with MSVC, it doesn't require MSYS nor Cygwin. It just requires MSVC. In "RE: Plasma scenarios" on Mon, 9 Sep 2019 22:05:26 +, Eric Erhardt wrote: > I don't think the C# bindings would use the Glib-based libraries on Windows > if it requires installing MSYS2 or Cygwin on the end-user's Windows machine. > So don't go through the work building the Glib-based libraries with MSVC on > account of the C# library. > > -Original Message- > From: Sutou Kouhei > Sent: Monday, September 9, 2019 4:43 PM > To: dev@arrow.apache.org > Subject: Re: Plasma scenarios > > Hi, > >> In theory you could use the GLib-based library with MSVC, the main >> requirement is gobject-introspection >> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith >> ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r >> stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d >> 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622 >> 234878257sdata=2V8%2Fdf1jBeXmgZChjnTJU37ZqOQTf0GrLKw5d%2B%2FsFaY% >> 3Dreserved=0 > > Generally, we can use the GLib-based library without GObject Introspection if > we write bindings by hand. (We can generate bindings automatically with > GObject Introspection.) > > But we need to some tasks to build the GLib-based library with MSVC. I'll > work on it in a few months. > > > Thanks, > -- > kou > > In > "Re: Plasma scenarios" on Mon, 9 Sep 2019 12:00:00 -0500, > Wes McKinney wrote: > >> hi Eric, >> >> On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt >> wrote: >>> >>> I was looking for the high level scenarios for the Plasma In-Memory Object >>> Store. A colleague of mine suggested we could use it to pass data between a >>> C# process and a Python process. >>> >>> I've read the intro blog [0] on Plasma, which describes using the same data >>> set from multiple processes - which sounds like the same scenario as above. >>> >>> I am trying to prioritize creating C# bindings for the Plasma client. So >>> I'd like to know all the scenarios that would could be enabled with Plasma. >>> >>> For example: >>> - could using Plasma speed up Pandas UDFs in PySpark? Because the data >>> wouldn't have to go across the socket between Java and Python, but instead >>> would be memory-mapped. We have similar functionality in .NET for Apache >>> Spark. >> >> Memory still would need to be copied into the memory-mappable file, so >> it's unclear whether this would be faster than passing the data >> through a socket as it's being done now. >> >>> - Is Plasma being used by Nvidia RAPIDS? >> >> AFAIK it is not. It doesn't seem out of the question, though, given >> that we have some level of CUDA support in Plasma now. >> >>> >>> I know Plasma today is not supported on Windows, but I think support could >>> be added since Windows supports memory mapped files (through a different >>> API than mmap) and it now supports Unix Domain Sockets [1]. >>> >>> Also - side question about the c_glib bindings. I assume those will only >>> ever work on Windows with something like Cygwin or MSYS2, right? Would >>> people be opposed to adding pure "C" exports to the plasma library so the >>> C# bindings could use it? (similar to the JNI support today). >>> >> >> In theory you could use the GLib-based library with MSVC, the main >> requirement is gobject-introspection >> >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith >> ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r >> stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d >> 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622 >> 234883247sdata=8o2XPCj7xIkUgQSNMwJYMdqHVG2BNlNOqvE0P00TvEE%3D >> ;reserved=0 >> >> Note that GLib itself is LGPL-licensed -- since it is an optional >> component in Apache Arrow, it is OK for optional components to have an >> LGPL dependency (though ASF projects aren't allowed to have >> mandatory/hard dependencies on LGPL). So if you do go that route just >> beware the possible issues you might have down the road. >> >> I have no objection to adding a "plasma/plasma-c.h" with C exports. >> >>> Eric >>> >>> [0] >>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fray >>> -project.github.io%2F2017%2F08%2F08%2Fplasma-in-memory-object-store.h >>> tmldata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc >>> 6d98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036 >>> 622234883247sdata=XcXFtxsbgjXntJzX3foLTJQfgdQabEHQkneQeRQDWU0%3D >>> reserved=0 [1] >>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdev >>> blogs.microsoft.com%2Fcommandline%2Faf_unix-comes-to-windows%2Fd >>> ata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d98a08d7 >>> 356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622234883 >>> 247sdata=%2BN3hCDkppSQAHM2AMgk6SBunF70mjgXwD%2Boesz41aq0%3D
[jira] [Created] (ARROW-6503) [C++] Add an argument of memory pool object to SparseTensorConverter
Kenta Murata created ARROW-6503: --- Summary: [C++] Add an argument of memory pool object to SparseTensorConverter Key: ARROW-6503 URL: https://issues.apache.org/jira/browse/ARROW-6503 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kenta Murata Assignee: Kenta Murata According to the comment https://github.com/apache/arrow/pull/5290#discussion_r322244745, we need to have variants of some functions for supplying a memory pool object to SparseTensorConverter function. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6502) [GLib][CI] MinGW failure in CI
Wes McKinney created ARROW-6502: --- Summary: [GLib][CI] MinGW failure in CI Key: ARROW-6502 URL: https://issues.apache.org/jira/browse/ARROW-6502 Project: Apache Arrow Issue Type: Bug Components: GLib Reporter: Wes McKinney Fix For: 0.15.0 This failure seems to have crept in to master https://ci.appveyor.com/project/wesm/arrow/build/job/ocfkn9m0a3ux1ur5#L2288 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6501) [Format][C++] Remove non_zero_length field from SparseIndex
Kenta Murata created ARROW-6501: --- Summary: [Format][C++] Remove non_zero_length field from SparseIndex Key: ARROW-6501 URL: https://issues.apache.org/jira/browse/ARROW-6501 Project: Apache Arrow Issue Type: Improvement Components: C++, Format Reporter: Kenta Murata Assignee: Kenta Murata We can remove non_zero_length field from SparseIndex because it can be supplied from the shape of the indices tensor. -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: Can the R interface to write_parquet accept strings?
I'm referring to the arrow-devel and parquet-devel packages, which are C++ packages. If you built the R library (using install.package) against version 0.14.0 and then upgraded arrow-devel to 0.14.1 without rebuilding the R library, you could have this issue. I would recommend reinstalling the R package and see if the problem goes away. On Mon, Sep 9, 2019, 6:34 PM Daniel Feenberg wrote: > > > > On Mon, 9 Sep 2019, Wes McKinney wrote: > > > I'm a bit confused by the error message > > > > " > > Error in write_parquet_file(to_arrow(table), file) : > > Arrow error: IOError: Metadata contains Thrift LogicalType that is > > not recognized. > > " > > > > This error comes from > > > > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L455 > > > > This function should not be called at all during the execution of > > "write_parquet_file". > > > > Daniel, is it possible you changed the C++ library installed after > > building the "arrow" R package? The R package must generally be > > recompiled when the C++ library is upgraded > > > > We are not aware of changing anything in C++. It is just as yum left it. > We didn't compile the R arrow package at all, just used what yum supplied > from the distribution. Are you suggesting we compile the R package > ourselves, that the Scientific Linux distribution packages are > inconsistent? Note that the default C++ is rather old and it would be > problem to update it, since so many other packages depend on it. But we > could update Arrow, I suppose. > > Daniel Feenberg >
Re: Can the R interface to write_parquet accept strings?
On Mon, 9 Sep 2019, Wes McKinney wrote: I'm a bit confused by the error message " Error in write_parquet_file(to_arrow(table), file) : Arrow error: IOError: Metadata contains Thrift LogicalType that is not recognized. " This error comes from https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L455 This function should not be called at all during the execution of "write_parquet_file". Daniel, is it possible you changed the C++ library installed after building the "arrow" R package? The R package must generally be recompiled when the C++ library is upgraded We are not aware of changing anything in C++. It is just as yum left it. We didn't compile the R arrow package at all, just used what yum supplied from the distribution. Are you suggesting we compile the R package ourselves, that the Scientific Linux distribution packages are inconsistent? Note that the default C++ is rather old and it would be problem to update it, since so many other packages depend on it. But we could update Arrow, I suppose. Daniel Feenberg
Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets
I'm happy to start a new thread to focus on DoPut specifically. Middleware for Java has been in review. Best, David On 9/9/19, Wes McKinney wrote: > Ah, I think I'm referring to the format change around DoPut, for which > there is not a PR yet. Sorry for my confusion > > Do we want to start a separate discussion thread about that? > > https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing > > On Mon, Sep 9, 2019 at 3:51 PM Antoine Pitrou wrote: >> >> >> Isn't a middleware an implementation-specific concern? Does it need a >> formal vote? >> >> Regards >> >> Antoine. >> >> >> Le 09/09/2019 à 22:49, Wes McKinney a écrit : >> > It seems like there is positive feedback on the PR. Do we want to have >> > a vote about this? >> > >> > On Mon, Aug 12, 2019 at 7:54 AM David Li wrote: >> >> >> >> I've (finally) put up a draft implementation of middleware for Java: >> >> https://github.com/apache/arrow/pull/5068 >> >> >> >> Hopefully this helps clarify how the proposal works. >> >> >> >> Best, >> >> David >> >> >> >> On 7/25/19, David Li wrote: >> >>> Thanks for the feedback, Antoine. That would be a natural method to >> >>> have - then the server could deny uploads (as you mention) or note >> >>> that the stream already exists. I've updated the proposal to reflect >> >>> that, leaving more detailed semantics (e.g. append vs overwrite) >> >>> application-defined. >> >>> >> >>> Best, >> >>> David >> >>> >> >>> On 7/25/19, Antoine Pitrou wrote: >> >> Le 08/07/2019 à 16:33, David Li a écrit : >> > Hi all, >> > >> > I've put together two more proposals for Flight, motivated by >> > projects >> > we've been working on. I'd appreciate any comments on the >> > design/reasoning; I'm already working on the implementation, >> > alongside >> > some other improvements to Flight. >> > >> > The first is to modify the DoPut call to follow the same request >> > pattern as DoGet. This is a format change and would require a vote. >> > >> > https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing >> >> It seems it would be useful to introduce a GetPutInfo (or >> GetUploadInfo) >> so as to allow differential behaviour between getting and putting. >> >> (one trivial case would be to disallow uploading altogether :-))) >> >> Regards >> >> Antoine. >> >> >>> >
[jira] [Created] (ARROW-6500) [Java] How to use RootAllocator in a low memory setting?
Andong Zhan created ARROW-6500: -- Summary: [Java] How to use RootAllocator in a low memory setting? Key: ARROW-6500 URL: https://issues.apache.org/jira/browse/ARROW-6500 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 0.13.0 Reporter: Andong Zhan When I run this simple code with JVM setting: "-Xmx64m" {code:java} package com.snowflake; import org.apache.arrow.memory.RootAllocator; public class TestArrow { public static void main(String args[]) throws Exception { new RootAllocator(Integer.MAX_VALUE); } } {code} and got the following error {code:java} Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStore=/etc/pki/ca-trust/extracted/java/cacertsPicked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStore=/etc/pki/ca-trust/extracted/java/cacertsException in thread "main" java.lang.ExceptionInInitializerError at org.apache.arrow.memory.BaseAllocator.createEmpty(BaseAllocator.java:263) at org.apache.arrow.memory.BaseAllocator.(BaseAllocator.java:89) at org.apache.arrow.memory.RootAllocator.(RootAllocator.java:34) at org.apache.arrow.memory.RootAllocator.(RootAllocator.java:30) at com.snowflake.TestArrow.main(TestArrow.java:13)Caused by: java.lang.NullPointerException at io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.(PooledByteBufAllocatorL.java:145) at io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49) at org.apache.arrow.memory.AllocationManager.(AllocationManager.java:61) ... 5 more Process finished with exit code 1 {code} So how to use RootAllocator in such low memory case? -- This message was sent by Atlassian Jira (v8.3.2#803003)
RE: Plasma scenarios
I don't think the C# bindings would use the Glib-based libraries on Windows if it requires installing MSYS2 or Cygwin on the end-user's Windows machine. So don't go through the work building the Glib-based libraries with MSVC on account of the C# library. -Original Message- From: Sutou Kouhei Sent: Monday, September 9, 2019 4:43 PM To: dev@arrow.apache.org Subject: Re: Plasma scenarios Hi, > In theory you could use the GLib-based library with MSVC, the main > requirement is gobject-introspection > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith > ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r > stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d > 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622 > 234878257sdata=2V8%2Fdf1jBeXmgZChjnTJU37ZqOQTf0GrLKw5d%2B%2FsFaY% > 3Dreserved=0 Generally, we can use the GLib-based library without GObject Introspection if we write bindings by hand. (We can generate bindings automatically with GObject Introspection.) But we need to some tasks to build the GLib-based library with MSVC. I'll work on it in a few months. Thanks, -- kou In "Re: Plasma scenarios" on Mon, 9 Sep 2019 12:00:00 -0500, Wes McKinney wrote: > hi Eric, > > On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt > wrote: >> >> I was looking for the high level scenarios for the Plasma In-Memory Object >> Store. A colleague of mine suggested we could use it to pass data between a >> C# process and a Python process. >> >> I've read the intro blog [0] on Plasma, which describes using the same data >> set from multiple processes - which sounds like the same scenario as above. >> >> I am trying to prioritize creating C# bindings for the Plasma client. So I'd >> like to know all the scenarios that would could be enabled with Plasma. >> >> For example: >> - could using Plasma speed up Pandas UDFs in PySpark? Because the data >> wouldn't have to go across the socket between Java and Python, but instead >> would be memory-mapped. We have similar functionality in .NET for Apache >> Spark. > > Memory still would need to be copied into the memory-mappable file, so > it's unclear whether this would be faster than passing the data > through a socket as it's being done now. > >> - Is Plasma being used by Nvidia RAPIDS? > > AFAIK it is not. It doesn't seem out of the question, though, given > that we have some level of CUDA support in Plasma now. > >> >> I know Plasma today is not supported on Windows, but I think support could >> be added since Windows supports memory mapped files (through a different API >> than mmap) and it now supports Unix Domain Sockets [1]. >> >> Also - side question about the c_glib bindings. I assume those will only >> ever work on Windows with something like Cygwin or MSYS2, right? Would >> people be opposed to adding pure "C" exports to the plasma library so the C# >> bindings could use it? (similar to the JNI support today). >> > > In theory you could use the GLib-based library with MSVC, the main > requirement is gobject-introspection > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith > ub.com%2FGNOME%2Fgobject-introspection%2Fblob%2Fmaster%2FMSVC.README.r > stdata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d > 98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622 > 234883247sdata=8o2XPCj7xIkUgQSNMwJYMdqHVG2BNlNOqvE0P00TvEE%3D > ;reserved=0 > > Note that GLib itself is LGPL-licensed -- since it is an optional > component in Apache Arrow, it is OK for optional components to have an > LGPL dependency (though ASF projects aren't allowed to have > mandatory/hard dependencies on LGPL). So if you do go that route just > beware the possible issues you might have down the road. > > I have no objection to adding a "plasma/plasma-c.h" with C exports. > >> Eric >> >> [0] >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fray >> -project.github.io%2F2017%2F08%2F08%2Fplasma-in-memory-object-store.h >> tmldata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc >> 6d98a08d7356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036 >> 622234883247sdata=XcXFtxsbgjXntJzX3foLTJQfgdQabEHQkneQeRQDWU0%3D >> reserved=0 [1] >> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdev >> blogs.microsoft.com%2Fcommandline%2Faf_unix-comes-to-windows%2Fd >> ata=02%7C01%7CEric.Erhardt%40microsoft.com%7Cca22053d07d84cc6d98a08d7 >> 356ec83b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036622234883 >> 247sdata=%2BN3hCDkppSQAHM2AMgk6SBunF70mjgXwD%2Boesz41aq0%3D >> reserved=0
[jira] [Created] (ARROW-6499) [C++] Add support for bundled Boost with MSVC
Sutou Kouhei created ARROW-6499: --- Summary: [C++] Add support for bundled Boost with MSVC Key: ARROW-6499 URL: https://issues.apache.org/jira/browse/ARROW-6499 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Sutou Kouhei Assignee: Sutou Kouhei -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: Plasma scenarios
Hi, > In theory you could use the GLib-based library with MSVC, the main > requirement is gobject-introspection > > https://github.com/GNOME/gobject-introspection/blob/master/MSVC.README.rst Generally, we can use the GLib-based library without GObject Introspection if we write bindings by hand. (We can generate bindings automatically with GObject Introspection.) But we need to some tasks to build the GLib-based library with MSVC. I'll work on it in a few months. Thanks, -- kou In "Re: Plasma scenarios" on Mon, 9 Sep 2019 12:00:00 -0500, Wes McKinney wrote: > hi Eric, > > On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt > wrote: >> >> I was looking for the high level scenarios for the Plasma In-Memory Object >> Store. A colleague of mine suggested we could use it to pass data between a >> C# process and a Python process. >> >> I've read the intro blog [0] on Plasma, which describes using the same data >> set from multiple processes - which sounds like the same scenario as above. >> >> I am trying to prioritize creating C# bindings for the Plasma client. So I'd >> like to know all the scenarios that would could be enabled with Plasma. >> >> For example: >> - could using Plasma speed up Pandas UDFs in PySpark? Because the data >> wouldn't have to go across the socket between Java and Python, but instead >> would be memory-mapped. We have similar functionality in .NET for Apache >> Spark. > > Memory still would need to be copied into the memory-mappable file, so > it's unclear whether this would be faster than passing the data > through a socket as it's being done now. > >> - Is Plasma being used by Nvidia RAPIDS? > > AFAIK it is not. It doesn't seem out of the question, though, given > that we have some level of CUDA support in Plasma now. > >> >> I know Plasma today is not supported on Windows, but I think support could >> be added since Windows supports memory mapped files (through a different API >> than mmap) and it now supports Unix Domain Sockets [1]. >> >> Also - side question about the c_glib bindings. I assume those will only >> ever work on Windows with something like Cygwin or MSYS2, right? Would >> people be opposed to adding pure "C" exports to the plasma library so the C# >> bindings could use it? (similar to the JNI support today). >> > > In theory you could use the GLib-based library with MSVC, the main > requirement is gobject-introspection > > https://github.com/GNOME/gobject-introspection/blob/master/MSVC.README.rst > > Note that GLib itself is LGPL-licensed -- since it is an optional > component in Apache Arrow, it is OK for optional components to have an > LGPL dependency (though ASF projects aren't allowed to have > mandatory/hard dependencies on LGPL). So if you do go that route just > beware the possible issues you might have down the road. > > I have no objection to adding a "plasma/plasma-c.h" with C exports. > >> Eric >> >> [0] >> https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html >> [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/
Re: Can the R interface to write_parquet accept strings?
I'm a bit confused by the error message " Error in write_parquet_file(to_arrow(table), file) : Arrow error: IOError: Metadata contains Thrift LogicalType that is not recognized. " This error comes from https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L455 This function should not be called at all during the execution of "write_parquet_file". Daniel, is it possible you changed the C++ library installed after building the "arrow" R package? The R package must generally be recompiled when the C++ library is upgraded On Mon, Sep 9, 2019 at 4:29 PM Daniel Feenberg wrote: > > > > On Mon, 9 Sep 2019, Neal Richardson wrote: > > > Hi Daniel, > > This works on my machine: > > > >> library(arrow) > >> write_parquet(data.frame(y = c("a", "b", "c"), stringsAsFactors=FALSE), > >> file= "string.parquet") > >> read_parquet("string.parquet") > > y > > 1 a > > 2 b > > 3 c > >> > > > > (The function masking warnings are all from library(tidyverse) and > > aren't relevant here.) > > > > What OS are you on, and how did you install the arrow package? I'm on > > macOS and installed arrow from CRAN, but if that's not the case for > > you, then your C++ library may have different capabilities. > > Here are the details of our installation: > > 1) OS: > -- > Scientific Linux 7 > uname: Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 12:36:06 CST 2018 > x86_64 x86_64 x86_64 GNU/Linux > > 2) gcc version: > > # gcc --version > gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) > > > 3) arrow and parquet library installation: > -- > yum install arrow-devel parquet-devel > > versions: > arrow-devel: yum info arrow-devel > Installed Packages > Name: arrow-devel > Arch: x86_64 > Version : 0.14.1 > Release : 1.el7 > Size: 20 M > Repo: installed > From repo : apache-arrow > Summary : Libraries and header files for Apache Arrow C++ > URL : https://arrow.apache.org/ > License : Apache-2.0 > Description : Libraries and header files for Apache Arrow C++. > > > yum info parquet-devel > Installed Packages > Name: parquet-devel > Arch: x86_64 > Version : 0.14.1 > Release : 1.el7 > Size: 6.4 M > Repo: installed > >From repo : apache-arrow > Summary : Libraries and header files for Apache Parquet C++ > URL : https://arrow.apache.org/ > License : Apache-2.0 > Description : Libraries and header files for Apache Parquet C++. > > > 4) R arrow installation: > -- > install.packages("arrow") > > and also > > install.packages("sparklyr") > > Thanks for taking an interest. > > Daniel Feenberg > > >
Re: Plasma scenarios
Hi, > I know Plasma today is not supported on Windows, but I think support could be > added since Windows supports memory mapped files (through a different API > than mmap) and it now supports Unix Domain Sockets [1]. > ... > [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/ Thanks for the information. I read the document. It seems that Unix domain socket on Windows doesn't support file descriptor passing: > Ancillary data: Linux‘s unix socket implementation supports passing ancillary > data such as passing file descriptors Plasma uses this feature: https://github.com/apache/arrow/blob/master/cpp/src/plasma/fling.cc#L33 https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L940 Thanks, -- kou In "Plasma scenarios" on Fri, 6 Sep 2019 22:09:38 +, Eric Erhardt wrote: > I was looking for the high level scenarios for the Plasma In-Memory Object > Store. A colleague of mine suggested we could use it to pass data between a > C# process and a Python process. > > I've read the intro blog [0] on Plasma, which describes using the same data > set from multiple processes - which sounds like the same scenario as above. > > I am trying to prioritize creating C# bindings for the Plasma client. So I'd > like to know all the scenarios that would could be enabled with Plasma. > > For example: > - could using Plasma speed up Pandas UDFs in PySpark? Because the data > wouldn't have to go across the socket between Java and Python, but instead > would be memory-mapped. We have similar functionality in .NET for Apache > Spark. > - Is Plasma being used by Nvidia RAPIDS? > > I know Plasma today is not supported on Windows, but I think support could be > added since Windows supports memory mapped files (through a different API > than mmap) and it now supports Unix Domain Sockets [1]. > > Also - side question about the c_glib bindings. I assume those will only ever > work on Windows with something like Cygwin or MSYS2, right? Would people be > opposed to adding pure "C" exports to the plasma library so the C# bindings > could use it? (similar to the JNI support today). > > Eric > > [0] > https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html > [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/
Re: Can the R interface to write_parquet accept strings?
On Mon, 9 Sep 2019, Neal Richardson wrote: Hi Daniel, This works on my machine: library(arrow) write_parquet(data.frame(y = c("a", "b", "c"), stringsAsFactors=FALSE), file= "string.parquet") read_parquet("string.parquet") y 1 a 2 b 3 c (The function masking warnings are all from library(tidyverse) and aren't relevant here.) What OS are you on, and how did you install the arrow package? I'm on macOS and installed arrow from CRAN, but if that's not the case for you, then your C++ library may have different capabilities. Here are the details of our installation: 1) OS: -- Scientific Linux 7 uname: Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 12:36:06 CST 2018 x86_64 x86_64 x86_64 GNU/Linux 2) gcc version: # gcc --version gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) 3) arrow and parquet library installation: -- yum install arrow-devel parquet-devel versions: arrow-devel: yum info arrow-devel Installed Packages Name: arrow-devel Arch: x86_64 Version : 0.14.1 Release : 1.el7 Size: 20 M Repo: installed From repo : apache-arrow Summary : Libraries and header files for Apache Arrow C++ URL : https://arrow.apache.org/ License : Apache-2.0 Description : Libraries and header files for Apache Arrow C++. yum info parquet-devel Installed Packages Name: parquet-devel Arch: x86_64 Version : 0.14.1 Release : 1.el7 Size: 6.4 M Repo: installed From repo : apache-arrow Summary : Libraries and header files for Apache Parquet C++ URL : https://arrow.apache.org/ License : Apache-2.0 Description : Libraries and header files for Apache Parquet C++. 4) R arrow installation: -- install.packages("arrow") and also install.packages("sparklyr") Thanks for taking an interest. Daniel Feenberg
Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets
Ah, I think I'm referring to the format change around DoPut, for which there is not a PR yet. Sorry for my confusion Do we want to start a separate discussion thread about that? https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing On Mon, Sep 9, 2019 at 3:51 PM Antoine Pitrou wrote: > > > Isn't a middleware an implementation-specific concern? Does it need a > formal vote? > > Regards > > Antoine. > > > Le 09/09/2019 à 22:49, Wes McKinney a écrit : > > It seems like there is positive feedback on the PR. Do we want to have > > a vote about this? > > > > On Mon, Aug 12, 2019 at 7:54 AM David Li wrote: > >> > >> I've (finally) put up a draft implementation of middleware for Java: > >> https://github.com/apache/arrow/pull/5068 > >> > >> Hopefully this helps clarify how the proposal works. > >> > >> Best, > >> David > >> > >> On 7/25/19, David Li wrote: > >>> Thanks for the feedback, Antoine. That would be a natural method to > >>> have - then the server could deny uploads (as you mention) or note > >>> that the stream already exists. I've updated the proposal to reflect > >>> that, leaving more detailed semantics (e.g. append vs overwrite) > >>> application-defined. > >>> > >>> Best, > >>> David > >>> > >>> On 7/25/19, Antoine Pitrou wrote: > > Le 08/07/2019 à 16:33, David Li a écrit : > > Hi all, > > > > I've put together two more proposals for Flight, motivated by projects > > we've been working on. I'd appreciate any comments on the > > design/reasoning; I'm already working on the implementation, alongside > > some other improvements to Flight. > > > > The first is to modify the DoPut call to follow the same request > > pattern as DoGet. This is a format change and would require a vote. > > > > https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing > > It seems it would be useful to introduce a GetPutInfo (or GetUploadInfo) > so as to allow differential behaviour between getting and putting. > > (one trivial case would be to disallow uploading altogether :-))) > > Regards > > Antoine. > > >>>
Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets
Isn't a middleware an implementation-specific concern? Does it need a formal vote? Regards Antoine. Le 09/09/2019 à 22:49, Wes McKinney a écrit : > It seems like there is positive feedback on the PR. Do we want to have > a vote about this? > > On Mon, Aug 12, 2019 at 7:54 AM David Li wrote: >> >> I've (finally) put up a draft implementation of middleware for Java: >> https://github.com/apache/arrow/pull/5068 >> >> Hopefully this helps clarify how the proposal works. >> >> Best, >> David >> >> On 7/25/19, David Li wrote: >>> Thanks for the feedback, Antoine. That would be a natural method to >>> have - then the server could deny uploads (as you mention) or note >>> that the stream already exists. I've updated the proposal to reflect >>> that, leaving more detailed semantics (e.g. append vs overwrite) >>> application-defined. >>> >>> Best, >>> David >>> >>> On 7/25/19, Antoine Pitrou wrote: Le 08/07/2019 à 16:33, David Li a écrit : > Hi all, > > I've put together two more proposals for Flight, motivated by projects > we've been working on. I'd appreciate any comments on the > design/reasoning; I'm already working on the implementation, alongside > some other improvements to Flight. > > The first is to modify the DoPut call to follow the same request > pattern as DoGet. This is a format change and would require a vote. > > https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing It seems it would be useful to introduce a GetPutInfo (or GetUploadInfo) so as to allow differential behaviour between getting and putting. (one trivial case would be to disallow uploading altogether :-))) Regards Antoine. >>>
Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets
It seems like there is positive feedback on the PR. Do we want to have a vote about this? On Mon, Aug 12, 2019 at 7:54 AM David Li wrote: > > I've (finally) put up a draft implementation of middleware for Java: > https://github.com/apache/arrow/pull/5068 > > Hopefully this helps clarify how the proposal works. > > Best, > David > > On 7/25/19, David Li wrote: > > Thanks for the feedback, Antoine. That would be a natural method to > > have - then the server could deny uploads (as you mention) or note > > that the stream already exists. I've updated the proposal to reflect > > that, leaving more detailed semantics (e.g. append vs overwrite) > > application-defined. > > > > Best, > > David > > > > On 7/25/19, Antoine Pitrou wrote: > >> > >> Le 08/07/2019 à 16:33, David Li a écrit : > >>> Hi all, > >>> > >>> I've put together two more proposals for Flight, motivated by projects > >>> we've been working on. I'd appreciate any comments on the > >>> design/reasoning; I'm already working on the implementation, alongside > >>> some other improvements to Flight. > >>> > >>> The first is to modify the DoPut call to follow the same request > >>> pattern as DoGet. This is a format change and would require a vote. > >>> > >>> https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing > >> > >> It seems it would be useful to introduce a GetPutInfo (or GetUploadInfo) > >> so as to allow differential behaviour between getting and putting. > >> > >> (one trivial case would be to disallow uploading altogether :-))) > >> > >> Regards > >> > >> Antoine. > >> > >
[jira] [Created] (ARROW-6498) [C++][CI] Download googletest tarball and use for EP build to avoid occasional flakiness
Wes McKinney created ARROW-6498: --- Summary: [C++][CI] Download googletest tarball and use for EP build to avoid occasional flakiness Key: ARROW-6498 URL: https://issues.apache.org/jira/browse/ARROW-6498 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Wes McKinney Failures such as https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27281370/job/dn0ji349v8popkd9 seem to be happening a fair amount. We might try to avoid this by wget-ing a tarball and setting {{$ARROW_GTEST_URL}}. Open to other ideas about how to reduce flakiness -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: [Format] Semantics for dictionary batches in streams
hi Micah, I think we should formulate changes to format/Columnar.rst and have a vote, what do you think? On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield wrote: >> >> >> > I was thinking the file format must satisfy one of two conditions: >> > 1. Exactly one dictionarybatch per encoded column >> > 2. DictionaryBatches are interleaved correctly. >> >> Could you clarify? > > I think you clarified it very well :) My motivation for suggesting the > additional complexity is I see two use-cases for the file format. These > roughly correspond with the two options I suggested: > 1. We are encoding data from scratch. In this case, it seems like all > dictionaries would be built incrementally, not need replacement and we write > them at the end of the file [1] > > 2. The data being written out is essentially a "tee" off of some stream that > is generating new dictionaries requiring replacement on the fly (i.e. reading > back two parquet files). > >> It might be better to disallow replacements >> in the file format (which does introduce semantic slippage between the >> file and stream formats as Antoine was saying). > > It is is certainly possible, to accept the slippage from the stream format > for now and later add this capability, since it should be forwards compatible. > > Thanks, > Micah > > [1] There is also medium complexity option where we require one non-delta > dictionary and as many delta dictionaries as the user want. > > On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney wrote: >> >> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield >> wrote: >> > >> > I was thinking the file format must satisfy one of two conditions: >> > 1. Exactly one dictionarybatch per encoded column >> > 2. DictionaryBatches are interleaved correctly. >> >> Could you clarify? In the first case, there is no issue with >> dictionary replacements. I'm not sure about the second case -- if a >> dictionary id appears twice, then you'll see it twice in the file >> footer. I suppose you could look at the file offsets to determine >> whether a dictionary batch precedes a particular record batch block >> (to know which dictionary you should be using), but that's rather >> complicated to implement. It might be better to disallow replacements >> in the file format (which does introduce semantic slippage between the >> file and stream formats as Antoine was saying). >> >> > >> > On Tuesday, August 27, 2019, Wes McKinney wrote: >> > >> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou >> > > wrote: >> > > > >> > > > >> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : >> > > > > So the current situation we have right now in C++ is that if we tried >> > > > > to create an IPC stream from a sequence of record batches that don't >> > > > > all have the same dictionary, we'd run into two scenarios: >> > > > > >> > > > > * Batches that either have a prefix of a prior-observed dictionary, >> > > > > or >> > > > > the prior dictionary is a prefix of their dictionary. For example, >> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and >> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In >> > > > > such case we could compute and send a delta batch >> > > > > >> > > > > * Batches with a dictionary that is a permutation of values, and >> > > > > possibly new unique values. >> > > > > >> > > > > In this latter case, without the option of replacing an existing ID >> > > > > in >> > > > > the stream, we would have to do a unification / permutation of >> > > > > indices >> > > > > and then also possibly send a delta batch. We should probably have >> > > > > code at some point that deals with both cases, but in the meantime I >> > > > > would like to allow dictionaries to be redefined in this case. Seems >> > > > > like we might need a vote to formalize this? >> > > > >> > > > Isn't the stream format deviating from the file format then? In the >> > > > file format, IIUC, dictionaries can appear after the respective record >> > > > batches, so there's no way to tell whether the original or redefined >> > > > version of a dictionary is being referred to. >> > > >> > > You make a good point -- we can consider changes to the file format to >> > > allow for record batches to have different dictionaries. Even handling >> > > delta dictionaries with the current file format would be a bit tedious >> > > (though not indeterminate) >> > > >> > > > Regards >> > > > >> > > > Antoine. >> > >
[jira] [Created] (ARROW-6497) [Website] On change to master branch, automatically make PR to asf-site
Neal Richardson created ARROW-6497: -- Summary: [Website] On change to master branch, automatically make PR to asf-site Key: ARROW-6497 URL: https://issues.apache.org/jira/browse/ARROW-6497 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Neal Richardson Assignee: Neal Richardson I added a build/deploy script to arrow-site that would enable automatically publishing to asf-site when there is a commit to the master branch. However, ASF won't let us add a deploy key to enable this publishing (INFRA-18924). I have a workaround that's not automatic but as close as we can get. On commits to apache/arrow-site's master branch, Travis builds the site and pushes it to a fork of arrow-site (where there is no restriction on deploy keys), and then it makes a PR from there back to the asf-site branch of apache/arrow-site using [hub|https://hub.github.com/hub-pull-request.1.html]. So it's "semiautomatic": the asf-site PR is made automatically, but a committer will need to merge it. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6496) [Python] Fix ARROW_ORC=ON build in Python wheels on macOS
Wes McKinney created ARROW-6496: --- Summary: [Python] Fix ARROW_ORC=ON build in Python wheels on macOS Key: ARROW-6496 URL: https://issues.apache.org/jira/browse/ARROW-6496 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney This was disabled in ARROW-6446 https://github.com/apache/arrow/pull/5291 as it was failing -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: Plasma scenarios
hi Eric, On Fri, Sep 6, 2019 at 5:09 PM Eric Erhardt wrote: > > I was looking for the high level scenarios for the Plasma In-Memory Object > Store. A colleague of mine suggested we could use it to pass data between a > C# process and a Python process. > > I've read the intro blog [0] on Plasma, which describes using the same data > set from multiple processes - which sounds like the same scenario as above. > > I am trying to prioritize creating C# bindings for the Plasma client. So I'd > like to know all the scenarios that would could be enabled with Plasma. > > For example: > - could using Plasma speed up Pandas UDFs in PySpark? Because the data > wouldn't have to go across the socket between Java and Python, but instead > would be memory-mapped. We have similar functionality in .NET for Apache > Spark. Memory still would need to be copied into the memory-mappable file, so it's unclear whether this would be faster than passing the data through a socket as it's being done now. > - Is Plasma being used by Nvidia RAPIDS? AFAIK it is not. It doesn't seem out of the question, though, given that we have some level of CUDA support in Plasma now. > > I know Plasma today is not supported on Windows, but I think support could be > added since Windows supports memory mapped files (through a different API > than mmap) and it now supports Unix Domain Sockets [1]. > > Also - side question about the c_glib bindings. I assume those will only ever > work on Windows with something like Cygwin or MSYS2, right? Would people be > opposed to adding pure "C" exports to the plasma library so the C# bindings > could use it? (similar to the JNI support today). > In theory you could use the GLib-based library with MSVC, the main requirement is gobject-introspection https://github.com/GNOME/gobject-introspection/blob/master/MSVC.README.rst Note that GLib itself is LGPL-licensed -- since it is an optional component in Apache Arrow, it is OK for optional components to have an LGPL dependency (though ASF projects aren't allowed to have mandatory/hard dependencies on LGPL). So if you do go that route just beware the possible issues you might have down the road. I have no objection to adding a "plasma/plasma-c.h" with C exports. > Eric > > [0] > https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html > [1] https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/
Re: Can the R interface to write_parquet accept strings?
Hi Daniel, This works on my machine: > library(arrow) > write_parquet(data.frame(y = c("a", "b", "c"), stringsAsFactors=FALSE), file= > "string.parquet") > read_parquet("string.parquet") y 1 a 2 b 3 c > (The function masking warnings are all from library(tidyverse) and aren't relevant here.) What OS are you on, and how did you install the arrow package? I'm on macOS and installed arrow from CRAN, but if that's not the case for you, then your C++ library may have different capabilities. Neal On Sun, Sep 8, 2019 at 3:41 AM Daniel Feenberg wrote: > > Can the R interface to Arrow Parquet write string data? Take the > following script: > >library(arrow) >library(tidyverse) >write_parquet(table = tibble(y = c("a", "b", "c")), file = > "string.parquet") > > I get the error message: > >Error in write_parquet_file(to_arrow(table), file) : >Arrow error: IOError: Metadata contains Thrift LogicalType that is >not recognized. > > after warnings that stats::filter(), stats::lag() and > arrow::read_table() are masked, but I assume that isn't the problem. > This is with R 3.5.1 and arrow_0.14.1.1 > > > Daniel Feenberg
[jira] [Created] (ARROW-6495) [Plasma] Use xxh3 for object hashing
Antoine Pitrou created ARROW-6495: - Summary: [Plasma] Use xxh3 for object hashing Key: ARROW-6495 URL: https://issues.apache.org/jira/browse/ARROW-6495 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Reporter: Antoine Pitrou We recently vendored xxh3 in Arrow. Plasma may want to use it for object hashing, since it's supposed to be even faster than XXH64. See https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for performance numbers. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6494) [C++] Implement basic PartitionScheme
Benjamin Kietzman created ARROW-6494: Summary: [C++] Implement basic PartitionScheme Key: ARROW-6494 URL: https://issues.apache.org/jira/browse/ARROW-6494 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman The PartitionScheme interface parses paths and yields the partition expressions which are encoded in those paths. For example, the Hive partition scheme would yield {{"a"_=2 and "b"_=3}} from "a=2/b=3/*.parquet". -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6492) [Python] file written with latest fastparquet cannot be read with latest pyarrow
Joris Van den Bossche created ARROW-6492: Summary: [Python] file written with latest fastparquet cannot be read with latest pyarrow Key: ARROW-6492 URL: https://issues.apache.org/jira/browse/ARROW-6492 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From report on the pandas issue tracker: >https://github.com/pandas-dev/pandas/issues/28252 With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), writing a file with pandas using the fastparquet engine cannot be read with the pyarrow engine: {code} df = pd.DataFrame({'A': [1, 2, 3]}) df.to_parquet("test.parquet", engine="fastparquet", compression=None) pd.read_parquet("test.parquet", engine="pyarrow") {code} gives the following error when reading: {code} > 1 pd.read_parquet("test.parquet", engine="pyarrow") ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs) 292 293 impl = get_engine(engine) --> 294 return impl.read(path, columns=columns, **kwargs) ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs) 123 kwargs["use_pandas_metadata"] = True 124 result = self.api.parquet.read_table( --> 125 path, columns=columns, **kwargs 126 ).to_pandas() 127 if should_close: ~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas() ~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas() ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata) 642 column_indexes = pandas_metadata.get('column_indexes', []) 643 index_descriptors = pandas_metadata['index_columns'] --> 644 table = _add_any_metadata(table, pandas_metadata) 645 table, index = _reconstruct_index(table, index_descriptors, 646 all_columns) ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata) 965 raw_name = 'None' 966 --> 967 idx = schema.get_field_index(raw_name) 968 if idx != -1: 969 if col_meta['pandas_type'] == 'datetimetz': ~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.get_field_index() ~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string() TypeError: expected bytes, dict found {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6491) [Java] fix master build failure caused by ErrorProne
Pindikura Ravindra created ARROW-6491: - Summary: [Java] fix master build failure caused by ErrorProne Key: ARROW-6491 URL: https://issues.apache.org/jira/browse/ARROW-6491 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Pindikura Ravindra Assignee: Ji Liu -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6490) [Java] log error for leak in allocator close
Pindikura Ravindra created ARROW-6490: - Summary: [Java] log error for leak in allocator close Key: ARROW-6490 URL: https://issues.apache.org/jira/browse/ARROW-6490 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra Currently, the allocator close throws an exception that includes some details in case of memory leaks. However, if there is a hierarchy of allocators and they are all closed at different times, it's hard to find the cause of the original leak. If we also log a message when the leak occurs, it will be easier to correlate these. -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson
Congratulations! On Sat, 7 Sep 2019 at 20:54, Rok Mihevc wrote: > Congrats all! > > On Sat, Sep 7, 2019 at 5:02 AM Bryan Cutler wrote: > > > Congrats Ben, Kenta and Neal! > > > > On Fri, Sep 6, 2019, 12:15 PM Krisztián Szűcs > > > wrote: > > > > > Congratulations! > > > > > > On Fri, Sep 6, 2019 at 8:12 PM Ben Kietzman > > > wrote: > > > > > > > Thanks! > > > > > > > > On Fri, Sep 6, 2019 at 1:09 PM Micah Kornfield < > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > Congrats everyone! (apologies if I double sent this). > > > > > > > > > > On Fri, Sep 6, 2019 at 10:06 AM Neal Richardson < > > > > > neal.p.richard...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks, y'all! > > > > > > > > > > > > On Fri, Sep 6, 2019 at 5:44 AM David Li > > > wrote: > > > > > > > > > > > > > > Congrats all! :) > > > > > > > > > > > > > > Best, > > > > > > > David > > > > > > > > > > > > > > On 9/6/19, Francois Saint-Jacques > > wrote: > > > > > > > > Congrats to everyone! > > > > > > > > > > > > > > > > François > > > > > > > > > > > > > > > > On Fri, Sep 6, 2019 at 4:34 AM Kenta Murata > > > wrote: > > > > > > > >> > > > > > > > >> Thank you very much everyone! > > > > > > > >> I'm very happy to join this community. > > > > > > > >> > > > > > > > >> 2019年9月6日(金) 12:39 Micah Kornfield : > > > > > > > >> > > > > > > > >> > > > > > > > > >> > Congrats everyone. > > > > > > > >> > > > > > > > > >> > On Thu, Sep 5, 2019 at 7:06 PM Ji Liu > > > > > > > > > > > > > > > > >> > wrote: > > > > > > > >> > > > > > > > > >> > > Congratulations! > > > > > > > >> > > > > > > > > > >> > > Thanks, > > > > > > > >> > > Ji Liu > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > -- > > > > > > > >> > > From:Fan Liya > > > > > > > >> > > Send Time:2019年9月6日(星期五) 09:28 > > > > > > > >> > > To:dev > > > > > > > >> > > Subject:Re: [ANNOUNCE] New committers: Ben Kietzman, > Kenta > > > > > Murata, > > > > > > > >> > > and > > > > > > > >> > > Neal Richardson > > > > > > > >> > > > > > > > > > >> > > Big congratulations to Ben, Kenta and Neal! > > > > > > > >> > > > > > > > > > >> > > Best, > > > > > > > >> > > Liya Fan > > > > > > > >> > > > > > > > > > >> > > On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney < > > > > > wesmck...@gmail.com> > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > >> > > > hi all, > > > > > > > >> > > > > > > > > > > >> > > > on behalf of the Arrow PMC, I'm pleased to announce > that > > > > Ben, > > > > > > > >> > > > Kenta, > > > > > > > >> > > > and Neal have accepted invitations to become Arrow > > > > committers. > > > > > > > >> > > > Welcome > > > > > > > >> > > > and thank you for all your contributions! > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> -- > > > > > > > >> Kenta Murata > > > > > > > >> OpenPGP FP = 1D69 ADDE 081C 9CC2 2E54 98C1 CEFE 8AFB 6081 > > B062 > > > > > > > >> > > > > > > > >> 本を書きました!! > > > > > > > >> 『Ruby 逆引きレシピ』 http://www.amazon.co.jp/dp/4798119881/mrkn-22 > > > > > > > >> > > > > > > > >> E-mail: m...@mrkn.jp > > > > > > > >> twitter: http://twitter.com/mrkn/ > > > > > > > >> blog: http://d.hatena.ne.jp/mrkn/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > >