Re: [Discuss] Format additions to Arrow for sparse data and data integrity
Hi Jacques, > That's quite interesting. Can you share more about the use case. Sorry I realized I missed answering this. We are still investigating, so the initial diagnosis might be off. The use-case is a data transfer application, reading data at rest, translating it to arrow and sending it out to clients. I look forward hearing your thoughts on the rest of the proposal. Thanks, Micah On Sat, Jul 6, 2019 at 2:53 PM Jacques Nadeau wrote: > What is the driving force for transport compression? Are you seeing that >>> as a major bottleneck in particular circumstances? (I'm not disagreeing, >>> just want to clearly define the particular problem you're worried about.) >> >> >> I've been working on a 20% project where we appear to be IO bound for >> transporting record batches. Also, I believe Ji Liu (tianchen92) has been >> seeing some of the same bottlenecks with the query engine they are is >> working on. Trading off some CPU here would allow us to lower the overall >> latency in the system. >> > > That's quite interesting. Can you share more about the use case. With the > exception of broadcast and round-robin type distribution patterns, we find > that there is typically more cycles focused on partitioning the sending > data such that IO bounding is less of a problem. In most of our operations, > almost all the largest workloads are done via partitioning thus it isn't > typically a problem. (We also have clients with 10gbps and 100gbps network > interconnects...) Are you partitioning the data pre-send? > > > >> Random thought: what do you think of defining this at the transport level >>> rather than the record batch level? (e.g. in Arrow Flight). This is one way >>> to avoid extending the core record batch concept with something that isn't >>> related to processing (at least in your initial proposal) >> >> >> Per above, this seems like a reasonable approach to me if we want to hold >> off on buffer level compression. Another use-case for buffer/record-batch >> level compression would be the Feather file format for only decompressing >> subset of columns/rows. If this use-case isn't compelling, I'd be happy to >> hold off adding compression to sparse batches until we have benchmarks >> showing the trade-off between channel level and buffer level compression. >> > > I was proposing that type specific buffer encodings be done at the Flight > level, not message level encodings. Just want to make sure the formats > don't leak into the core spec until we're ready. >
Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"
Hi Eric, Short answer: I think your understanding matches what I was proposing. Longer answer below. So, for example, we release library v1.0.0 in a few months and then library > v2.0.0 a few months after that. In v2.0.0, C++, Python, and Java didn't > make any breaking API changes from 1.0.0. But C# made 3 API breaking > changes. This would be acceptable? Yes. I think all language bindings are going under rapid enough iteration that we are making at least a few small breaking API changes on each release even though we try to avoid it. I think it will be worth having further discussions on the release process once at least a few languages get to a more stable point. Thanks, Micah On Tue, Jul 9, 2019 at 2:26 PM Eric Erhardt wrote: > Just to be sure I fully understand the proposal: > > For the Library Version, we are going to increment the MAJOR version on > every normal release, and increment the MINOR version if we need to release > a patch/bug fix type of release. > > Since SemVer allows for API breaking changes on MAJOR versions, this > basically means, each library (C++, Python, C#, Java, etc) _can_ introduce > API breaking changes on every normal release (like we have been with the > 0.x.0 releases). > > So, for example, we release library v1.0.0 in a few months and then > library v2.0.0 a few months after that. In v2.0.0, C++, Python, and Java > didn't make any breaking API changes from 1.0.0. But C# made 3 API breaking > changes. This would be acceptable? > > If my understanding above is correct, then I think this is a good plan. > Initially I was concerned that the C# library wouldn't be free to make API > breaking changes with making the version `1.0.0`. The C# library is still > pretty inadequate, and I have a feeling there are a few things that will > need to change about it in the future. But with the above plan, this > concern won't be a problem. > > Eric > > -Original Message- > From: Micah Kornfield > Sent: Monday, July 1, 2019 10:02 PM > To: Wes McKinney > Cc: dev@arrow.apache.org > Subject: Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0" > > Hi Wes, > Thanks for your response. In regards to the protocol negotiation your > description of feature reporting (snipped below) is along the lines of what > I was thinking. It might not be necessary for 1.0.0, but at some point > might become useful. > > > > Note that we don't really have a mechanism for clients and servers to > > report to each other what features they support, so this could help > > with that when for applications where it might matter. > > > Thanks, > Micah > > > On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney wrote: > > > hi Micah, > > > > Sorry for the delay in feedback. I looked at the document and it seems > > like a reasonable perspective about forward- and > > backward-compatibility. > > > > It seems like the main thing you are proposing is to apply Semantic > > Versioning to Format and Library versions separately. That's an > > interesting idea, my thought had been to have a version number that is > > FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is > > more flexible in some ways, so let me clarify for others reading > > > > In what you are proposing, the next release would be: > > > > Format version: 1.0.0 > > Library version: 1.0.0 > > > > Suppose that 20 major versions down the road we stand at > > > > Format version: 1.5.0 > > Library version: 20.0.0 > > > > The minor version of the Format would indicate that there are > > additions, like new elements in the Type union, but otherwise backward > > and forward compatible. So the Minor version means "new things, but > > old clients will not be disrupted if those new things are not used". > > We've already been doing this since the V4 Format iteration but we > > have not had a way to signal that there may be new features. As a > > corollary to this, I wonder if we should create a dual version in the > > metadata > > > > PROTOCOL VERSION: (what is currently MetadataVersion, V2) FEATURE > > VERSION: not tracked at all > > > > So Minor version bumps in the format would trigger a bump in the > > FeatureVersion. Note that we don't really have a mechanism for clients > > and servers to report to each other what features they support, so > > this could help with that when for applications where it might matter. > > > > Should backward/forward compatibility be disrupted in the future, then > > a change to the major version would be required. So in year 2025, say, > > we might decide that we want to do: > > > > Format version: 2.0.0 > > Library version: 21.0.0 > > > > The Format version would live in the project's Documentation, so the > > Apache releases are only the library version. > > > > Regarding your open questions: > > > > 1. Should we clean up "warts" on the specification, like redundant > > information > > > > I don't think it's necessary. So if Metadata V5 is Format Version > > 1.0.0 (currently we are V4, but
[jira] [Created] (ARROW-5897) [Java] Remove duplicated logic in MapVector
Liya Fan created ARROW-5897: --- Summary: [Java] Remove duplicated logic in MapVector Key: ARROW-5897 URL: https://issues.apache.org/jira/browse/ARROW-5897 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Current implementation of MapVector contains much logic duplicate from the super class. We remove the duplication by: # Making the default data vector name configurable # Extract a method for creating the reader -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
Thanks for the feedback. I just posted a PR that removes the class in the C++ and Python libraries, hopefully this will help with the discussion. That I was able to do it in less than a day should be good evidence that the abstraction may be superfluous https://github.com/apache/arrow/pull/4841 On Tue, Jul 9, 2019 at 4:26 PM Tim Swast wrote: > > FWIW, I found the Column class to be confusing in Python. It felt redundant > / unneeded to actually create Tables. > > On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney wrote: > > > On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou wrote: > > > > > > > > > Le 08/07/2019 à 23:17, Wes McKinney a écrit : > > > > > > > > I'm concerned about continuing to maintain the Column class as it's > > > > spilling complexity into computational libraries and bindings alike. > > > > > > > > The Python Column class for example mostly forwards method calls to > > > > the underlying ChunkedArray > > > > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > > > > > > > If the developer wants to construct a Table or insert a new "column", > > > > Column objects must generally be constructed, leading to boilerplate > > > > without clear benefit. > > > > > > We could simply add the desired ChunkedArray-based convenience methods > > > without removing the Column-based APIs. > > > > > > I don't know if it's really cumbersome to maintain the Column class. > > > It's generally a very stable part of the API, and the Column class is > > > just a thin wrapper over a ChunkedArray + a field. > > > > > > > The indirection that it produces in public APIs I have found to be a > > nuisance, though (for example, doing things with the result of > > table[i] in Python). > > > > I'm about halfway through a patch to remove it, I'll let people review > > the work to assess the before-and-after. > > > > > Regards > > > > > > Antoine. > >
Re: Spark and Arrow Flight
Hi Ryan, have you thought about developing this inside Apache Arrow? On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler wrote: > Great, thanks Ryan! I'll take a look > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray wrote: > > > Hi Bryan, > > > > I have an implementation of option #3 nearly ready for a PR. I will > mention > > you when I publish it. > > > > The working prototype for the Spark connector is here: > > https://github.com/rymurr/flight-spark-source. It technically works (and > > is > > very fast!) however the implementation is pretty dodgy and needs to be > > cleaned up before ready for prime time. I plan to have it ready to go for > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout > if > > you have any comments or are interested in contributing! > > > > Best, > > Ryan > > > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler wrote: > > > > > I'm in favor of option #3 also, but not sure what the best thing to do > > with > > > the existing FlightInfo response is. I'm definitely interested in > > > connecting Spark with Flight, can you share more details of your work > or > > is > > > it planned to be open sourced? > > > > > > Thanks, > > > Bryan > > > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou > > wrote: > > > > > > > > > > > Either #3 or #4 for me. If #3, the default GetSchema implementation > > can > > > > rely on calling GetFlightInfo. > > > > > > > > > > > > Le 01/07/2019 à 22:50, David Li a écrit : > > > > > I think I'd prefer #3 over overloading an existing call (#2). > > > > > > > > > > We've been thinking about a similar issue, where sometimes we want > > > > > just the schema, but the service can't necessarily return the > schema > > > > > without fetching data - right now we return a sentinel value in > > > > > GetFlightInfo, but a separate RPC would let us explicitly indicate > an > > > > > error. > > > > > > > > > > I might be missing something though - what happens between step 1 > and > > > > > 2 that makes the endpoints available? Would it make sense to use > > > > > DoAction to cause the backend to "prepare" the endpoints, and have > > the > > > > > result of that be an encoded schema? So then the flow would be > > > > > DoAction -> GetFlightInfo -> DoGet. > > > > > > > > > > Best, > > > > > David > > > > > > > > > > On 7/1/19, Wes McKinney wrote: > > > > >> My inclination is either #2 or #3. #4 is an option of course, but > I > > > > >> like the more structured solution of explicitly requesting the > > schema > > > > >> given a descriptor. > > > > >> > > > > >> In both cases, it's possible that schemas are sent twice, e.g. if > > you > > > > >> call GetSchema and then later call GetFlightInfo and so you > receive > > > > >> the schema again. The schema is optional, so if it became a > > > > >> performance problem then a particular server might return the > schema > > > > >> as null from GetFlightInfo. > > > > >> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC > > request > > > > >> that returns _both_ the schema and the query plan. > > > > >> > > > > >> Thoughts from others? > > > > >> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau < > jacq...@apache.org> > > > > wrote: > > > > >>> > > > > >>> My initial inclination is towards #3 but I'd be curious what > others > > > > >>> think. > > > > >>> In the case of #3, I wonder if it makes sense to then pull the > > Schema > > > > off > > > > >>> the GetFlightInfo response... > > > > >>> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray > > > > wrote: > > > > >>> > > > > Hi All, > > > > > > > > I have been working on building an arrow flight source for > spark. > > > The > > > > goal > > > > here is for Spark to be able to use a group of arrow flight > > > endpoints > > > > to > > > > get a dataset pulled over to spark in parallel. > > > > > > > > I am unsure of the best model for the spark <-> flight > > conversation > > > > and > > > > wanted to get your opinion on the best way to go. > > > > > > > > I am breaking up the query to flight from spark into 3 parts: > > > > 1) get the schema using GetFlightInfo. This is needed to do > > further > > > > lazy > > > > operations in Spark > > > > 2) get the endpoints by calling GetFlightInfo a 2nd time with a > > > > different > > > > argument. This returns the list endpoints on the parallel flight > > > > server. > > > > The endpoints are not available till data is ready to be > fetched, > > > > which > > > > is > > > > done after the schema but is needed before DoGet is called. > > > > 3) call get stream on all endpoints from 2 > > > > > > > > I think I have to do each step however I don't like having to > call > > > > getInfo > > > > twice, it doesn't seem very elegant. I see a few options: > > > > 1) live with calling GetFlightInfo twice and with a custom bytes > > cmd > > > > to > >
[jira] [Created] (ARROW-5896) [C#] Array Builders should take an initial capacity in their constructors
Eric Erhardt created ARROW-5896: --- Summary: [C#] Array Builders should take an initial capacity in their constructors Key: ARROW-5896 URL: https://issues.apache.org/jira/browse/ARROW-5896 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt When using the Fluent Array Builder API, we should take in an initial capacity in the constructor, so we can avoid allocating unnecessary memory. Today, if you create a builder, and then .Reserve(length) on it, the initial byte[] that was created in the constructor is wasted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Spark and Arrow Flight
Great, thanks Ryan! I'll take a look On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray wrote: > Hi Bryan, > > I have an implementation of option #3 nearly ready for a PR. I will mention > you when I publish it. > > The working prototype for the Spark connector is here: > https://github.com/rymurr/flight-spark-source. It technically works (and > is > very fast!) however the implementation is pretty dodgy and needs to be > cleaned up before ready for prime time. I plan to have it ready to go for > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if > you have any comments or are interested in contributing! > > Best, > Ryan > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler wrote: > > > I'm in favor of option #3 also, but not sure what the best thing to do > with > > the existing FlightInfo response is. I'm definitely interested in > > connecting Spark with Flight, can you share more details of your work or > is > > it planned to be open sourced? > > > > Thanks, > > Bryan > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou > wrote: > > > > > > > > Either #3 or #4 for me. If #3, the default GetSchema implementation > can > > > rely on calling GetFlightInfo. > > > > > > > > > Le 01/07/2019 à 22:50, David Li a écrit : > > > > I think I'd prefer #3 over overloading an existing call (#2). > > > > > > > > We've been thinking about a similar issue, where sometimes we want > > > > just the schema, but the service can't necessarily return the schema > > > > without fetching data - right now we return a sentinel value in > > > > GetFlightInfo, but a separate RPC would let us explicitly indicate an > > > > error. > > > > > > > > I might be missing something though - what happens between step 1 and > > > > 2 that makes the endpoints available? Would it make sense to use > > > > DoAction to cause the backend to "prepare" the endpoints, and have > the > > > > result of that be an encoded schema? So then the flow would be > > > > DoAction -> GetFlightInfo -> DoGet. > > > > > > > > Best, > > > > David > > > > > > > > On 7/1/19, Wes McKinney wrote: > > > >> My inclination is either #2 or #3. #4 is an option of course, but I > > > >> like the more structured solution of explicitly requesting the > schema > > > >> given a descriptor. > > > >> > > > >> In both cases, it's possible that schemas are sent twice, e.g. if > you > > > >> call GetSchema and then later call GetFlightInfo and so you receive > > > >> the schema again. The schema is optional, so if it became a > > > >> performance problem then a particular server might return the schema > > > >> as null from GetFlightInfo. > > > >> > > > >> I think it's valid to want to make a single GetFlightInfo RPC > request > > > >> that returns _both_ the schema and the query plan. > > > >> > > > >> Thoughts from others? > > > >> > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau > > > wrote: > > > >>> > > > >>> My initial inclination is towards #3 but I'd be curious what others > > > >>> think. > > > >>> In the case of #3, I wonder if it makes sense to then pull the > Schema > > > off > > > >>> the GetFlightInfo response... > > > >>> > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray > > > wrote: > > > >>> > > > Hi All, > > > > > > I have been working on building an arrow flight source for spark. > > The > > > goal > > > here is for Spark to be able to use a group of arrow flight > > endpoints > > > to > > > get a dataset pulled over to spark in parallel. > > > > > > I am unsure of the best model for the spark <-> flight > conversation > > > and > > > wanted to get your opinion on the best way to go. > > > > > > I am breaking up the query to flight from spark into 3 parts: > > > 1) get the schema using GetFlightInfo. This is needed to do > further > > > lazy > > > operations in Spark > > > 2) get the endpoints by calling GetFlightInfo a 2nd time with a > > > different > > > argument. This returns the list endpoints on the parallel flight > > > server. > > > The endpoints are not available till data is ready to be fetched, > > > which > > > is > > > done after the schema but is needed before DoGet is called. > > > 3) call get stream on all endpoints from 2 > > > > > > I think I have to do each step however I don't like having to call > > > getInfo > > > twice, it doesn't seem very elegant. I see a few options: > > > 1) live with calling GetFlightInfo twice and with a custom bytes > cmd > > > to > > > differentiate the purpose of each call > > > 2) add an argument to GetFlightInfo to tell it its being called > only > > > for > > > the schema > > > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to > > return > > > just > > > the Schema in question > > > 4) use DoAction and wrap the expected FlightInfo in a Result > > > > > > I am aware that 4 is probably
Re: Spark and Arrow Flight
Hi Bryan, I have an implementation of option #3 nearly ready for a PR. I will mention you when I publish it. The working prototype for the Spark connector is here: https://github.com/rymurr/flight-spark-source. It technically works (and is very fast!) however the implementation is pretty dodgy and needs to be cleaned up before ready for prime time. I plan to have it ready to go for the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if you have any comments or are interested in contributing! Best, Ryan On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler wrote: > I'm in favor of option #3 also, but not sure what the best thing to do with > the existing FlightInfo response is. I'm definitely interested in > connecting Spark with Flight, can you share more details of your work or is > it planned to be open sourced? > > Thanks, > Bryan > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou wrote: > > > > > Either #3 or #4 for me. If #3, the default GetSchema implementation can > > rely on calling GetFlightInfo. > > > > > > Le 01/07/2019 à 22:50, David Li a écrit : > > > I think I'd prefer #3 over overloading an existing call (#2). > > > > > > We've been thinking about a similar issue, where sometimes we want > > > just the schema, but the service can't necessarily return the schema > > > without fetching data - right now we return a sentinel value in > > > GetFlightInfo, but a separate RPC would let us explicitly indicate an > > > error. > > > > > > I might be missing something though - what happens between step 1 and > > > 2 that makes the endpoints available? Would it make sense to use > > > DoAction to cause the backend to "prepare" the endpoints, and have the > > > result of that be an encoded schema? So then the flow would be > > > DoAction -> GetFlightInfo -> DoGet. > > > > > > Best, > > > David > > > > > > On 7/1/19, Wes McKinney wrote: > > >> My inclination is either #2 or #3. #4 is an option of course, but I > > >> like the more structured solution of explicitly requesting the schema > > >> given a descriptor. > > >> > > >> In both cases, it's possible that schemas are sent twice, e.g. if you > > >> call GetSchema and then later call GetFlightInfo and so you receive > > >> the schema again. The schema is optional, so if it became a > > >> performance problem then a particular server might return the schema > > >> as null from GetFlightInfo. > > >> > > >> I think it's valid to want to make a single GetFlightInfo RPC request > > >> that returns _both_ the schema and the query plan. > > >> > > >> Thoughts from others? > > >> > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau > > wrote: > > >>> > > >>> My initial inclination is towards #3 but I'd be curious what others > > >>> think. > > >>> In the case of #3, I wonder if it makes sense to then pull the Schema > > off > > >>> the GetFlightInfo response... > > >>> > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray > > wrote: > > >>> > > Hi All, > > > > I have been working on building an arrow flight source for spark. > The > > goal > > here is for Spark to be able to use a group of arrow flight > endpoints > > to > > get a dataset pulled over to spark in parallel. > > > > I am unsure of the best model for the spark <-> flight conversation > > and > > wanted to get your opinion on the best way to go. > > > > I am breaking up the query to flight from spark into 3 parts: > > 1) get the schema using GetFlightInfo. This is needed to do further > > lazy > > operations in Spark > > 2) get the endpoints by calling GetFlightInfo a 2nd time with a > > different > > argument. This returns the list endpoints on the parallel flight > > server. > > The endpoints are not available till data is ready to be fetched, > > which > > is > > done after the schema but is needed before DoGet is called. > > 3) call get stream on all endpoints from 2 > > > > I think I have to do each step however I don't like having to call > > getInfo > > twice, it doesn't seem very elegant. I see a few options: > > 1) live with calling GetFlightInfo twice and with a custom bytes cmd > > to > > differentiate the purpose of each call > > 2) add an argument to GetFlightInfo to tell it its being called only > > for > > the schema > > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to > return > > just > > the Schema in question > > 4) use DoAction and wrap the expected FlightInfo in a Result > > > > I am aware that 4 is probably the least disruptive but I'm also not > a > > fan > > as (to me) it implies performing an action on the server side. > > Suggestions > > 2 & 3 are larger changes and I am reluctant to do that unless there > is > > a > > consensus here. None of them are great options and I am wondering > what > > everyone thinks the best
Re: Spark and Arrow Flight
I'm in favor of option #3 also, but not sure what the best thing to do with the existing FlightInfo response is. I'm definitely interested in connecting Spark with Flight, can you share more details of your work or is it planned to be open sourced? Thanks, Bryan On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou wrote: > > Either #3 or #4 for me. If #3, the default GetSchema implementation can > rely on calling GetFlightInfo. > > > Le 01/07/2019 à 22:50, David Li a écrit : > > I think I'd prefer #3 over overloading an existing call (#2). > > > > We've been thinking about a similar issue, where sometimes we want > > just the schema, but the service can't necessarily return the schema > > without fetching data - right now we return a sentinel value in > > GetFlightInfo, but a separate RPC would let us explicitly indicate an > > error. > > > > I might be missing something though - what happens between step 1 and > > 2 that makes the endpoints available? Would it make sense to use > > DoAction to cause the backend to "prepare" the endpoints, and have the > > result of that be an encoded schema? So then the flow would be > > DoAction -> GetFlightInfo -> DoGet. > > > > Best, > > David > > > > On 7/1/19, Wes McKinney wrote: > >> My inclination is either #2 or #3. #4 is an option of course, but I > >> like the more structured solution of explicitly requesting the schema > >> given a descriptor. > >> > >> In both cases, it's possible that schemas are sent twice, e.g. if you > >> call GetSchema and then later call GetFlightInfo and so you receive > >> the schema again. The schema is optional, so if it became a > >> performance problem then a particular server might return the schema > >> as null from GetFlightInfo. > >> > >> I think it's valid to want to make a single GetFlightInfo RPC request > >> that returns _both_ the schema and the query plan. > >> > >> Thoughts from others? > >> > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau > wrote: > >>> > >>> My initial inclination is towards #3 but I'd be curious what others > >>> think. > >>> In the case of #3, I wonder if it makes sense to then pull the Schema > off > >>> the GetFlightInfo response... > >>> > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray > wrote: > >>> > Hi All, > > I have been working on building an arrow flight source for spark. The > goal > here is for Spark to be able to use a group of arrow flight endpoints > to > get a dataset pulled over to spark in parallel. > > I am unsure of the best model for the spark <-> flight conversation > and > wanted to get your opinion on the best way to go. > > I am breaking up the query to flight from spark into 3 parts: > 1) get the schema using GetFlightInfo. This is needed to do further > lazy > operations in Spark > 2) get the endpoints by calling GetFlightInfo a 2nd time with a > different > argument. This returns the list endpoints on the parallel flight > server. > The endpoints are not available till data is ready to be fetched, > which > is > done after the schema but is needed before DoGet is called. > 3) call get stream on all endpoints from 2 > > I think I have to do each step however I don't like having to call > getInfo > twice, it doesn't seem very elegant. I see a few options: > 1) live with calling GetFlightInfo twice and with a custom bytes cmd > to > differentiate the purpose of each call > 2) add an argument to GetFlightInfo to tell it its being called only > for > the schema > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return > just > the Schema in question > 4) use DoAction and wrap the expected FlightInfo in a Result > > I am aware that 4 is probably the least disruptive but I'm also not a > fan > as (to me) it implies performing an action on the server side. > Suggestions > 2 & 3 are larger changes and I am reluctant to do that unless there is > a > consensus here. None of them are great options and I am wondering what > everyone thinks the best approach might be? Particularly as I think > this > is > likely to come up in more applications than just spark. > > Best, > Ryan > > >> >
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
FWIW, I found the Column class to be confusing in Python. It felt redundant / unneeded to actually create Tables. On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney wrote: > On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou wrote: > > > > > > Le 08/07/2019 à 23:17, Wes McKinney a écrit : > > > > > > I'm concerned about continuing to maintain the Column class as it's > > > spilling complexity into computational libraries and bindings alike. > > > > > > The Python Column class for example mostly forwards method calls to > > > the underlying ChunkedArray > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > > > > > If the developer wants to construct a Table or insert a new "column", > > > Column objects must generally be constructed, leading to boilerplate > > > without clear benefit. > > > > We could simply add the desired ChunkedArray-based convenience methods > > without removing the Column-based APIs. > > > > I don't know if it's really cumbersome to maintain the Column class. > > It's generally a very stable part of the API, and the Column class is > > just a thin wrapper over a ChunkedArray + a field. > > > > The indirection that it produces in public APIs I have found to be a > nuisance, though (for example, doing things with the result of > table[i] in Python). > > I'm about halfway through a patch to remove it, I'll let people review > the work to assess the before-and-after. > > > Regards > > > > Antoine. >
RE: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"
Just to be sure I fully understand the proposal: For the Library Version, we are going to increment the MAJOR version on every normal release, and increment the MINOR version if we need to release a patch/bug fix type of release. Since SemVer allows for API breaking changes on MAJOR versions, this basically means, each library (C++, Python, C#, Java, etc) _can_ introduce API breaking changes on every normal release (like we have been with the 0.x.0 releases). So, for example, we release library v1.0.0 in a few months and then library v2.0.0 a few months after that. In v2.0.0, C++, Python, and Java didn't make any breaking API changes from 1.0.0. But C# made 3 API breaking changes. This would be acceptable? If my understanding above is correct, then I think this is a good plan. Initially I was concerned that the C# library wouldn't be free to make API breaking changes with making the version `1.0.0`. The C# library is still pretty inadequate, and I have a feeling there are a few things that will need to change about it in the future. But with the above plan, this concern won't be a problem. Eric -Original Message- From: Micah Kornfield Sent: Monday, July 1, 2019 10:02 PM To: Wes McKinney Cc: dev@arrow.apache.org Subject: Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0" Hi Wes, Thanks for your response. In regards to the protocol negotiation your description of feature reporting (snipped below) is along the lines of what I was thinking. It might not be necessary for 1.0.0, but at some point might become useful. > Note that we don't really have a mechanism for clients and servers to > report to each other what features they support, so this could help > with that when for applications where it might matter. Thanks, Micah On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney wrote: > hi Micah, > > Sorry for the delay in feedback. I looked at the document and it seems > like a reasonable perspective about forward- and > backward-compatibility. > > It seems like the main thing you are proposing is to apply Semantic > Versioning to Format and Library versions separately. That's an > interesting idea, my thought had been to have a version number that is > FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is > more flexible in some ways, so let me clarify for others reading > > In what you are proposing, the next release would be: > > Format version: 1.0.0 > Library version: 1.0.0 > > Suppose that 20 major versions down the road we stand at > > Format version: 1.5.0 > Library version: 20.0.0 > > The minor version of the Format would indicate that there are > additions, like new elements in the Type union, but otherwise backward > and forward compatible. So the Minor version means "new things, but > old clients will not be disrupted if those new things are not used". > We've already been doing this since the V4 Format iteration but we > have not had a way to signal that there may be new features. As a > corollary to this, I wonder if we should create a dual version in the > metadata > > PROTOCOL VERSION: (what is currently MetadataVersion, V2) FEATURE > VERSION: not tracked at all > > So Minor version bumps in the format would trigger a bump in the > FeatureVersion. Note that we don't really have a mechanism for clients > and servers to report to each other what features they support, so > this could help with that when for applications where it might matter. > > Should backward/forward compatibility be disrupted in the future, then > a change to the major version would be required. So in year 2025, say, > we might decide that we want to do: > > Format version: 2.0.0 > Library version: 21.0.0 > > The Format version would live in the project's Documentation, so the > Apache releases are only the library version. > > Regarding your open questions: > > 1. Should we clean up "warts" on the specification, like redundant > information > > I don't think it's necessary. So if Metadata V5 is Format Version > 1.0.0 (currently we are V4, but we're discussing some possible > non-forward compatible changes...) I think that's OK. None of these > things are "hurting" anything > > 2. Do we need additional mechanisms for marking some features as > experimental? > > Not sure, but I think this can be mostly addressed through > documentation. Flight will still be experimental in 1.0.0, for > example. > > 3. Do we need protocol negotiation mechanisms in Flight > > Could you explain what you mean? Are you thinking if there is some > major revamp of the protocol and you need to switch between a "V1 > Flight Protocol" and a "V2 Flight Protocol"? > > - Wes > > On Thu, Jun 13, 2019 at 2:17 AM Micah Kornfield > > wrote: > > > > Hi Everyone, > > I think there might be some ideas that we still need to reach > > consensus > on > > for how the format and libraries evolve in a post-1.0.0 release world. > > Specifically, I think we need to agree on
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
Hi Eric -- of course! On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt wrote: > Can we propose getting changes other than Python or Parquet related into > this release? > > For example, I found a critical issue in the C# implementation that, if > possible, I'd like to get included in a patch release. > https://github.com/apache/arrow/pull/4836 > > Eric > > -Original Message- > From: Wes McKinney > Sent: Tuesday, July 9, 2019 7:59 AM > To: dev@arrow.apache.org > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package > problems, Parquet forward compatibility problems > > On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei wrote: > > > > Hi, > > > > > If the problems can be resolved quickly, I should think we could cut > > > an RC for 0.14.1 by the end of this week. The RC could either be cut > > > from a maintenance branch or out of master -- any thoughts about > > > this (cutting from master is definitely easier)? > > > > How about just releasing 0.15.0 from master? > > It'll be simpler than creating a patch release. > > > > I'd be fine with that, too. > > > > > Thanks, > > -- > > kou > > > > In > > "[DISCUSS] Need for 0.14.1 release due to Python package problems, > Parquet forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500, > > Wes McKinney wrote: > > > > > hi folks, > > > > > > Perhaps unsurprisingly due to the expansion of our Python packages, > > > a number of things are broken in 0.14.0 that we should fix sooner > > > than the next major release. I'll try to send a complete list to > > > this thread to give a status within a day or two. Other problems may > > > arise in the next 48 hours as more people install the package. > > > > > > If the problems can be resolved quickly, I should think we could cut > > > an RC for 0.14.1 by the end of this week. The RC could either be cut > > > from a maintenance branch or out of master -- any thoughts about > > > this (cutting from master is definitely easier)? > > > > > > Would someone (who is not Kou) be able to assist with creating the RC? > > > > > > Thanks, > > > Wes >
RE: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
Can we propose getting changes other than Python or Parquet related into this release? For example, I found a critical issue in the C# implementation that, if possible, I'd like to get included in a patch release. https://github.com/apache/arrow/pull/4836 Eric -Original Message- From: Wes McKinney Sent: Tuesday, July 9, 2019 7:59 AM To: dev@arrow.apache.org Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei wrote: > > Hi, > > > If the problems can be resolved quickly, I should think we could cut > > an RC for 0.14.1 by the end of this week. The RC could either be cut > > from a maintenance branch or out of master -- any thoughts about > > this (cutting from master is definitely easier)? > > How about just releasing 0.15.0 from master? > It'll be simpler than creating a patch release. > I'd be fine with that, too. > > Thanks, > -- > kou > > In > "[DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet > forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500, > Wes McKinney wrote: > > > hi folks, > > > > Perhaps unsurprisingly due to the expansion of our Python packages, > > a number of things are broken in 0.14.0 that we should fix sooner > > than the next major release. I'll try to send a complete list to > > this thread to give a status within a day or two. Other problems may > > arise in the next 48 hours as more people install the package. > > > > If the problems can be resolved quickly, I should think we could cut > > an RC for 0.14.1 by the end of this week. The RC could either be cut > > from a maintenance branch or out of master -- any thoughts about > > this (cutting from master is definitely easier)? > > > > Would someone (who is not Kou) be able to assist with creating the RC? > > > > Thanks, > > Wes
[jira] [Created] (ARROW-5895) [Python] New version stores timestamps as epoch ms instead of ISO timestamp string
John Wilson created ARROW-5895: -- Summary: [Python] New version stores timestamps as epoch ms instead of ISO timestamp string Key: ARROW-5895 URL: https://issues.apache.org/jira/browse/ARROW-5895 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Environment: Linux dev.office.whoop.com 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Reporter: John Wilson Just upgraded from pyarrow 0.13 to 0.14. Columns of type TimestampType(timestmap[ns]) now get written as epoch ms values: 1561939200507 Where 0.13 wrote TimestampType(timestamp[ns]) as an ISO string: 2019-07-01T00:00:00.507Z This broke my implementation. How do I get pyarrow to write ISO strings again in 0.14? Here is my table write: {{ pyarrow.parquet.write_to_dataset(table=tbl, root_path=local_path,}} {{ partition_cols=['env', 'dt'],}} {{ coerce_timestamps='ms',}} {{ allow_truncated_timestamps=True,}} {{ version='2.0',}} {{ compression='SNAPPY')}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5894) libgandiva.so.14 is exporting libstdc++ symbols
Zhuo Peng created ARROW-5894: Summary: libgandiva.so.14 is exporting libstdc++ symbols Key: ARROW-5894 URL: https://issues.apache.org/jira/browse/ARROW-5894 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Affects Versions: 0.14.0 Reporter: Zhuo Peng For example: $ nm libgandiva.so.14 | grep "once_proxy" 018c0a10 T __once_proxy many other symbols are also exported which I guess shouldn't be (e.g. LLVM symbols) There seems to be no linker script for libgandiva.so (there was, but was never used and got deleted? [https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou wrote: > > > Le 08/07/2019 à 23:17, Wes McKinney a écrit : > > > > I'm concerned about continuing to maintain the Column class as it's > > spilling complexity into computational libraries and bindings alike. > > > > The Python Column class for example mostly forwards method calls to > > the underlying ChunkedArray > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > > > If the developer wants to construct a Table or insert a new "column", > > Column objects must generally be constructed, leading to boilerplate > > without clear benefit. > > We could simply add the desired ChunkedArray-based convenience methods > without removing the Column-based APIs. > > I don't know if it's really cumbersome to maintain the Column class. > It's generally a very stable part of the API, and the Column class is > just a thin wrapper over a ChunkedArray + a field. > The indirection that it produces in public APIs I have found to be a nuisance, though (for example, doing things with the result of table[i] in Python). I'm about halfway through a patch to remove it, I'll let people review the work to assess the before-and-after. > Regards > > Antoine.
[jira] [Created] (ARROW-5893) [C++] Remove arrow::Column class from C++ library
Wes McKinney created ARROW-5893: --- Summary: [C++] Remove arrow::Column class from C++ library Key: ARROW-5893 URL: https://issues.apache.org/jira/browse/ARROW-5893 Project: Apache Arrow Issue Type: New Feature Components: C++, GLib, MATLAB, Python, R Reporter: Wes McKinney Fix For: 1.0.0 Opening JIRA per ongoing discussion on mailing list. This class unfortunately touches a lot of places, so I'm going to start by removing it from the C++ and Python libraries to assist with discussion about its fate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
Le 08/07/2019 à 23:17, Wes McKinney a écrit : > > I'm concerned about continuing to maintain the Column class as it's > spilling complexity into computational libraries and bindings alike. > > The Python Column class for example mostly forwards method calls to > the underlying ChunkedArray > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > If the developer wants to construct a Table or insert a new "column", > Column objects must generally be constructed, leading to boilerplate > without clear benefit. We could simply add the desired ChunkedArray-based convenience methods without removing the Column-based APIs. I don't know if it's really cumbersome to maintain the Column class. It's generally a very stable part of the API, and the Column class is just a thin wrapper over a ChunkedArray + a field. Regards Antoine.
[jira] [Created] (ARROW-5892) [C++][Gandiva] Support function aliases
Prudhvi Porandla created ARROW-5892: --- Summary: [C++][Gandiva] Support function aliases Key: ARROW-5892 URL: https://issues.apache.org/jira/browse/ARROW-5892 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla This allows linking of several external names to the same precompiled function. For example, 'mod', 'modulo' can be used to access the mod function -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5891) [C++][Gandiva] Remove duplicates in function registries
Prudhvi Porandla created ARROW-5891: --- Summary: [C++][Gandiva] Remove duplicates in function registries Key: ARROW-5891 URL: https://issues.apache.org/jira/browse/ARROW-5891 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Prudhvi Porandla Each precompiled function should have at most one "NativeFunction" entry in the registry. Also add a UnitTest which checks if there are duplicates -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5890) [C++][Python] Support ExtensionType arrays in more kernels
Joris Van den Bossche created ARROW-5890: Summary: [C++][Python] Support ExtensionType arrays in more kernels Key: ARROW-5890 URL: https://issues.apache.org/jira/browse/ARROW-5890 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche >From a quick test (through Python), it seems that {{slice}} and {{take}} work, >but the following not: - {{cast}}: it could rely on the casting rules for the storage type. Or do we want that you explicitly have to take the storage array before casting? - {{dictionary_encode}} / {{unique}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5889) [Python][C++] Parquet backwards compat for timestamps without timezone broken
Florian Jetter created ARROW-5889: - Summary: [Python][C++] Parquet backwards compat for timestamps without timezone broken Key: ARROW-5889 URL: https://issues.apache.org/jira/browse/ARROW-5889 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.0 Reporter: Florian Jetter Attachments: 0.12.1.parquet, 0.13.0.parquet When reading a parquet file which has timestamp fields they are read as a timestamp with timezone UTC if the parquet file was written by pyarrow 0.13.0 and/or 0.12.1. Expected behavior would be that they are loaded as timestamps without any timezone information. The attached files contain one row for all basic types and a few nested types, the timestamp fields are called datetime64 and datetime64_tz see also [https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat] [https://github.com/JDASoftwareGroup/kartothek/blob/c47e52116e2dc726a74d7d6b97922a0252722ed0/tests/serialization/test_arrow_compat.py#L31] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5888) [Python][C++] Parquet write metadata not roundtrip safe for timezone timestamps
Florian Jetter created ARROW-5888: - Summary: [Python][C++] Parquet write metadata not roundtrip safe for timezone timestamps Key: ARROW-5888 URL: https://issues.apache.org/jira/browse/ARROW-5888 Project: Apache Arrow Issue Type: Bug Reporter: Florian Jetter The timezone is not roundtrip safe for timezones other than UTC when storing to parquet. Expected behavior would be that the timezone is properly reconstructed {code:python} schema = pa.schema( [ pa.field("no_tz", pa.timestamp('us')), pa.field("no_tz", pa.timestamp('us', tz="UTC")), pa.field("no_tz", pa.timestamp('us', tz="Europe/Berlin")), ] ) buf = pa.BufferOutputStream() pq.write_metadata( schema, buf, coerce_timestamps="us" ) pq_bytes = buf.getvalue().to_pybytes() reader = pa.BufferReader(pq_bytes) parquet_file = pq.ParquetFile(reader) parquet_file.schema.to_arrow_schema() # Output: # no_tz: timestamp[us] # utc: timestamp[us, tz=UTC] # europe: timestamp[us, tz=UTC] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
I'm also +1 on removing this class. François On Tue, Jul 9, 2019 at 10:57 AM Uwe L. Korn wrote: > > This sounds fine to me, thus I'm +1 on removing this class. > > On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote: > > Yes, the schema would be the point of truth for the Field. The ChunkedArray > > type would have to be validated against the schema types as with RecordBatch > > > > On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn wrote: > > > > > Hello Wes, > > > > > > where do you intend the Field object living then? Would this be part of > > > the schema of the Table object? > > > > > > Uwe > > > > > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote: > > > > hi folks, > > > > > > > > For some time now I have been uncertain about the utility provided by > > > > the arrow::Column C++ class. Fundamentally, it is a container for two > > > > things: > > > > > > > > * An arrow::Field object (name and data type) > > > > * An arrow::ChunkedArray object for the data > > > > > > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis > > > > for the arrow::Table class which represents a collection of > > > > ChunkedArray objects coming usually from multiple RecordBatches. > > > > Sometimes a Table will have mostly columns with a single chunk while > > > > some columns will have many chunks. > > > > > > > > I'm concerned about continuing to maintain the Column class as it's > > > > spilling complexity into computational libraries and bindings alike. > > > > > > > > The Python Column class for example mostly forwards method calls to > > > > the underlying ChunkedArray > > > > > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > > > > > > > If the developer wants to construct a Table or insert a new "column", > > > > Column objects must generally be constructed, leading to boilerplate > > > > without clear benefit. > > > > > > > > Since we're discussing building a more significant higher-level > > > > DataFrame interface per past mailing list discussions, my preference > > > > would be to consider removing the Column class to make the user- and > > > > developer-facing data structures simpler. I hate to propose breaking > > > > API changes, so it may not be practical at this point, but I wanted to > > > > at least bring up the issue to see if others have opinions after > > > > working with the library for a few years. > > > > > > > > Thanks > > > > Wes > > > > > > > > >
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
I'll try to spend a little time soon refactoring to see how disruptive the change would be, and also to help persuade others about the benefits. On Tue, Jul 9, 2019 at 9:57 AM Uwe L. Korn wrote: > > This sounds fine to me, thus I'm +1 on removing this class. > > On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote: > > Yes, the schema would be the point of truth for the Field. The ChunkedArray > > type would have to be validated against the schema types as with RecordBatch > > > > On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn wrote: > > > > > Hello Wes, > > > > > > where do you intend the Field object living then? Would this be part of > > > the schema of the Table object? > > > > > > Uwe > > > > > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote: > > > > hi folks, > > > > > > > > For some time now I have been uncertain about the utility provided by > > > > the arrow::Column C++ class. Fundamentally, it is a container for two > > > > things: > > > > > > > > * An arrow::Field object (name and data type) > > > > * An arrow::ChunkedArray object for the data > > > > > > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis > > > > for the arrow::Table class which represents a collection of > > > > ChunkedArray objects coming usually from multiple RecordBatches. > > > > Sometimes a Table will have mostly columns with a single chunk while > > > > some columns will have many chunks. > > > > > > > > I'm concerned about continuing to maintain the Column class as it's > > > > spilling complexity into computational libraries and bindings alike. > > > > > > > > The Python Column class for example mostly forwards method calls to > > > > the underlying ChunkedArray > > > > > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > > > > > > > If the developer wants to construct a Table or insert a new "column", > > > > Column objects must generally be constructed, leading to boilerplate > > > > without clear benefit. > > > > > > > > Since we're discussing building a more significant higher-level > > > > DataFrame interface per past mailing list discussions, my preference > > > > would be to consider removing the Column class to make the user- and > > > > developer-facing data structures simpler. I hate to propose breaking > > > > API changes, so it may not be practical at this point, but I wanted to > > > > at least bring up the issue to see if others have opinions after > > > > working with the library for a few years. > > > > > > > > Thanks > > > > Wes > > > > > > > > >
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
This sounds fine to me, thus I'm +1 on removing this class. On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote: > Yes, the schema would be the point of truth for the Field. The ChunkedArray > type would have to be validated against the schema types as with RecordBatch > > On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn wrote: > > > Hello Wes, > > > > where do you intend the Field object living then? Would this be part of > > the schema of the Table object? > > > > Uwe > > > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote: > > > hi folks, > > > > > > For some time now I have been uncertain about the utility provided by > > > the arrow::Column C++ class. Fundamentally, it is a container for two > > > things: > > > > > > * An arrow::Field object (name and data type) > > > * An arrow::ChunkedArray object for the data > > > > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis > > > for the arrow::Table class which represents a collection of > > > ChunkedArray objects coming usually from multiple RecordBatches. > > > Sometimes a Table will have mostly columns with a single chunk while > > > some columns will have many chunks. > > > > > > I'm concerned about continuing to maintain the Column class as it's > > > spilling complexity into computational libraries and bindings alike. > > > > > > The Python Column class for example mostly forwards method calls to > > > the underlying ChunkedArray > > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > > > > > If the developer wants to construct a Table or insert a new "column", > > > Column objects must generally be constructed, leading to boilerplate > > > without clear benefit. > > > > > > Since we're discussing building a more significant higher-level > > > DataFrame interface per past mailing list discussions, my preference > > > would be to consider removing the Column class to make the user- and > > > developer-facing data structures simpler. I hate to propose breaking > > > API changes, so it may not be practical at this point, but I wanted to > > > at least bring up the issue to see if others have opinions after > > > working with the library for a few years. > > > > > > Thanks > > > Wes > > > > > >
[jira] [Created] (ARROW-5887) [C#] ArrowStreamWriter writes FieldNodes in wrong order
Eric Erhardt created ARROW-5887: --- Summary: [C#] ArrowStreamWriter writes FieldNodes in wrong order Key: ARROW-5887 URL: https://issues.apache.org/jira/browse/ARROW-5887 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt When ArrowStreamWriter is writing a {{RecordBatch}} with {{null}}s in it, it is mixing up the column's {{NullCount}}. You can see here: [https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L195-L200] It is writing the fields from {{0}} -> {{fieldCount}} order. But then [lower|https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L216-L220], it is writing the fields from {{fieldCount}} -> {{0}}. Looking at the [Java implementation|https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/FBSerializables.java#L36-L44] it says {quote}// struct vectors have to be created in reverse order {quote} A simple test of roundtripping the following RecordBatch shows the issue: {code:java} var result = new RecordBatch( new Schema.Builder() .Field(f => f.Name("age").DataType(Int32Type.Default)) .Field(f => f.Name("CharCount").DataType(Int32Type.Default)) .Build(), new IArrowArray[] { new Int32Array( new ArrowBuffer.Builder().Append(0).Build(), new ArrowBuffer.Builder().Append(0).Build(), length: 1, nullCount: 1, offset: 0), new Int32Array( new ArrowBuffer.Builder().Append(7).Build(), ArrowBuffer.Empty, length: 1, nullCount: 0, offset: 0) }, length: 1); {code} Here, the "age" column should have a `null` in it. However, when you write and read this RecordBatch back, you see that the "CharCount" column has `NullCount` == 1 and "age" column has `NullCount` == 0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5886) [Python][Packaging] Manylinux1/2010 complience issue with libz
Krisztian Szucs created ARROW-5886: -- Summary: [Python][Packaging] Manylinux1/2010 complience issue with libz Key: ARROW-5886 URL: https://issues.apache.org/jira/browse/ARROW-5886 Project: Apache Arrow Issue Type: Bug Components: Packaging, Python Affects Versions: 0.14.0 Reporter: Krisztian Szucs So we statically link liblz4 in the manylinux1 wheels {code} # ldd pyarrow-manylinux1/libarrow.so.14 | grep z libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7fc28cef4000) {code} but dynamically in the manylinux2010 wheels {code} # ldd pyarrow-manylinux2010/libarrow.so.14 | grep z liblz4.so.1 => not found (already deleted to reproduce the issue) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f56f744) {code} this what this PR resolves. What I'm finding strange, that auditwheel seems to bundle libz for manylinux1: {code} # ls -lah pyarrow-manylinux1/*z*so.* -rwxr-xr-x 1 root root 115K Jun 29 00:14 pyarrow-manylinux1/libz-7f57503f.so.1.2.11 {code} while ldd still uses the system libz: {code} # ldd pyarrow-manylinux1/libarrow.so.14 | grep z libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f91fcf3f000) {code} For manylinux2010 we also have liblz4: {code} # ls -lah pyarrow-manylinux2010/*z*so.* -rwxr-xr-x 1 root root 191K Jun 28 23:38 pyarrow-manylinux2010/liblz4-8cb8bdde.so.1.8.3 -rwxr-xr-x 1 root root 115K Jun 28 23:38 pyarrow-manylinux2010/libz-c69b9943.so.1.2.11 {code} and ldd similarly tries to load the system libs: {code} # ldd pyarrow-manylinux2010/libarrow.so.14 | grep z liblz4.so.1 => not found libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7fd72764e000) {code} Inspecting manylinux1 with `LD_DEBUG=files,libs ldd libarrow.so.14` it seems like to search the right path, but cannot find the hashed version of libz `libz-7f57503f.so.1.2.11` {code} 463: file=libz.so.1 [0]; needed by ./libarrow.so.14 [0] 463: find library=libz.so.1 [0]; searching 463: search path=/tmp/pyarrow-manylinux1/. (RPATH from file ./libarrow.so.14) 463: trying file=/tmp/pyarrow-manylinux1/./libz.so.1 463: search cache=/etc/ld.so.cache 463: trying file=/lib/x86_64-linux-gnu/libz.so.1 {code} There is no `libz.so.1` just `libz-7f57503f.so.1.2.11`. Similarly for manylinux2010 and libz: {code} 470: file=libz.so.1 [0]; needed by ./libarrow.so.14 [0] 470: find library=libz.so.1 [0]; searching 470: search path=/tmp/pyarrow-manylinux2010/. (RPATH from file ./libarrow.so.14) 470: trying file=/tmp/pyarrow-manylinux2010/./libz.so.1 470: search cache=/etc/ld.so.cache 470: trying file=/lib/x86_64-linux-gnu/libz.so.1 {code} for liblz4 (again, I've deleted the system one): {code} 470: file=liblz4.so.1 [0]; needed by ./libarrow.so.14 [0] 470: find library=liblz4.so.1 [0]; searching 470: search path=/tmp/pyarrow-manylinux2010/. (RPATH from file ./libarrow.so.14) 470: trying file=/tmp/pyarrow-manylinux2010/./liblz4.so.1 470: search cache=/etc/ld.so.cache 470: search path=/lib/x86_64-linux-gnu/tls/x86_64:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu/x86_64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/tls/x86_64:/usr/lib/x86_64-linux-gnu/tls:/usr/lib/x86_64-linux-gnu/x86_6$ :/usr/lib/x86_64-linux-gnu:/lib/tls/x86_64:/lib/tls:/lib/x86_64:/lib:/usr/lib/tls/x86_64:/usr/lib/tls:/usr/lib/x86_64:/usr/lib (system search path) {code} There are no `libz.so.1` nor `liblz4.so.1`, just `libz-c69b9943.so.1.2.11` and `liblz4-8cb8bdde.so.1.8.3` According to https://www.python.org/dev/peps/pep-0571/ `liblz4` nor `libz` are part of the whitelist, and while these are bundled with the wheel, seemingly cannot be found - perhaps because of the hash in the library name? I've tried to inspect the wheels with `auditwheel show` with version `2` and `1.10`, both says the following: {code} # auditwheel show pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl is consistent with the following platform tag: "linux_x86_64". The wheel references external versioned symbols in these system- provided shared libraries: libgcc_s.so.1 with versions {'GCC_3.3', 'GCC_3.4', 'GCC_3.0'}, libpthread.so.0 with versions {'GLIBC_2.3.3', 'GLIBC_2.12', 'GLIBC_2.2.5', 'GLIBC_2.3.2'}, libc.so.6 with versions {'GLIBC_2.4', 'GLIBC_2.6', 'GLIBC_2.2.5', 'GLIBC_2.7', 'GLIBC_2.3.4', 'GLIBC_2.3.2', 'GLIBC_2.3'}, libstdc++.so.6 with versions {'CXXABI_1.3', 'GLIBCXX_3.4.10', 'GLIBCXX_3.4.9', 'GLIBCXX_3.4.11', 'GLIBCXX_3.4.5', 'GLIBCXX_3.4', 'CXXABI_1.3.2', 'CXXABI_1.3.3'}, librt.so.1 with versions {'GLIBC_2.2.5'}, libm.so.6 with versions {'GLIBC_2.2.5'}, libdl.so.2 with versions
[jira] [Created] (ARROW-5885) Support optional arrow components via extras_require
George Sakkis created ARROW-5885: Summary: Support optional arrow components via extras_require Key: ARROW-5885 URL: https://issues.apache.org/jira/browse/ARROW-5885 Project: Apache Arrow Issue Type: Wish Components: Python Reporter: George Sakkis Since Arrow (and pyarrow) have several independent optional component, instead of installing all of them it would be convenient if these could be opt-in from pip like {{pip install pyarrow[gandiva,flight,plasma]}} or opt-out like {{pip install pyarrow[no-gandiva,no-flight,no-plasma]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems
On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei wrote: > > Hi, > > > If the problems can be resolved quickly, I should think we could cut > > an RC for 0.14.1 by the end of this week. The RC could either be cut > > from a maintenance branch or out of master -- any thoughts about this > > (cutting from master is definitely easier)? > > How about just releasing 0.15.0 from master? > It'll be simpler than creating a patch release. > I'd be fine with that, too. > > Thanks, > -- > kou > > In > "[DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet > forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500, > Wes McKinney wrote: > > > hi folks, > > > > Perhaps unsurprisingly due to the expansion of our Python packages, a > > number of things are broken in 0.14.0 that we should fix sooner than > > the next major release. I'll try to send a complete list to this > > thread to give a status within a day or two. Other problems may arise > > in the next 48 hours as more people install the package. > > > > If the problems can be resolved quickly, I should think we could cut > > an RC for 0.14.1 by the end of this week. The RC could either be cut > > from a maintenance branch or out of master -- any thoughts about this > > (cutting from master is definitely easier)? > > > > Would someone (who is not Kou) be able to assist with creating the RC? > > > > Thanks, > > Wes
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
Yes, the schema would be the point of truth for the Field. The ChunkedArray type would have to be validated against the schema types as with RecordBatch On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn wrote: > Hello Wes, > > where do you intend the Field object living then? Would this be part of > the schema of the Table object? > > Uwe > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote: > > hi folks, > > > > For some time now I have been uncertain about the utility provided by > > the arrow::Column C++ class. Fundamentally, it is a container for two > > things: > > > > * An arrow::Field object (name and data type) > > * An arrow::ChunkedArray object for the data > > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis > > for the arrow::Table class which represents a collection of > > ChunkedArray objects coming usually from multiple RecordBatches. > > Sometimes a Table will have mostly columns with a single chunk while > > some columns will have many chunks. > > > > I'm concerned about continuing to maintain the Column class as it's > > spilling complexity into computational libraries and bindings alike. > > > > The Python Column class for example mostly forwards method calls to > > the underlying ChunkedArray > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > > > If the developer wants to construct a Table or insert a new "column", > > Column objects must generally be constructed, leading to boilerplate > > without clear benefit. > > > > Since we're discussing building a more significant higher-level > > DataFrame interface per past mailing list discussions, my preference > > would be to consider removing the Column class to make the user- and > > developer-facing data structures simpler. I hate to propose breaking > > API changes, so it may not be practical at this point, but I wanted to > > at least bring up the issue to see if others have opinions after > > working with the library for a few years. > > > > Thanks > > Wes > > >
[jira] [Created] (ARROW-5884) [Java] Fix the get method of StructVector
Liya Fan created ARROW-5884: --- Summary: [Java] Fix the get method of StructVector Key: ARROW-5884 URL: https://issues.apache.org/jira/browse/ARROW-5884 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan When the data at the specified location is null, there is no need to call the method from super to set the reader holder.isSet = isSet(index); super.get(index, holder); -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5883) [Java] Support Dictionary Encoding for List type
Ji Liu created ARROW-5883: - Summary: [Java] Support Dictionary Encoding for List type Key: ARROW-5883 URL: https://issues.apache.org/jira/browse/ARROW-5883 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu As described in [http://arrow.apache.org/docs/format/Layout.html#dictionary-encoding], List type encoding should be supported. Now ListVector getObject returns a ArrayList implementation, and its equals and hashCode are already overwritten, so it could be directly supported to be hashMap key in DictionaryEncoder. Since we won't change Dictionary data, use mutable key seems dose't matter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5882) [C++][Gandiva] Throw error if divisor is 0 in integer mod functions
Prudhvi Porandla created ARROW-5882: --- Summary: [C++][Gandiva] Throw error if divisor is 0 in integer mod functions Key: ARROW-5882 URL: https://issues.apache.org/jira/browse/ARROW-5882 Project: Apache Arrow Issue Type: Bug Reporter: Prudhvi Porandla mod_int64_int32, mod_int64_int64 should throw an error when divisor is 0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class
Hello Wes, where do you intend the Field object living then? Would this be part of the schema of the Table object? Uwe On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote: > hi folks, > > For some time now I have been uncertain about the utility provided by > the arrow::Column C++ class. Fundamentally, it is a container for two > things: > > * An arrow::Field object (name and data type) > * An arrow::ChunkedArray object for the data > > It was added to the C++ library in ARROW-23 in March 2016 as the basis > for the arrow::Table class which represents a collection of > ChunkedArray objects coming usually from multiple RecordBatches. > Sometimes a Table will have mostly columns with a single chunk while > some columns will have many chunks. > > I'm concerned about continuing to maintain the Column class as it's > spilling complexity into computational libraries and bindings alike. > > The Python Column class for example mostly forwards method calls to > the underlying ChunkedArray > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355 > > If the developer wants to construct a Table or insert a new "column", > Column objects must generally be constructed, leading to boilerplate > without clear benefit. > > Since we're discussing building a more significant higher-level > DataFrame interface per past mailing list discussions, my preference > would be to consider removing the Column class to make the user- and > developer-facing data structures simpler. I hate to propose breaking > API changes, so it may not be practical at this point, but I wanted to > at least bring up the issue to see if others have opinions after > working with the library for a few years. > > Thanks > Wes >
[jira] [Created] (ARROW-5881) [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
Liya Fan created ARROW-5881: --- Summary: [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits Key: ARROW-5881 URL: https://issues.apache.org/jira/browse/ARROW-5881 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan These utilities can be used to efficiently determine, for example, * If all values in a vector are null * If a vector contains no null * If a vector contains any valid element * If a vector contains any invalid element -- This message was sent by Atlassian JIRA (v7.6.3#76005)