Re: [Discuss] Do a 0.15.0 release before 1.0.0?
+1 on a 0.15.0 release. At the minimum, if we could detect the stream and provide a clear error message for Python and Java I think that would help the transition. If we are also able to implement readers/writers that can fallback to 4-byte prefix, then that would be nice to have. On Wed, Jul 24, 2019 at 1:27 PM Jacques Nadeau wrote: > I'm ok with the change and 0.15 release to better manage it. > > > > I've always understood the metadata to be a few dozen/hundred KB, a > > small percentage of the total message size. I could be underestimating > > the ratios though -- is it common to have tables w/ 1000+ columns? I've > > seen a few reports like that in cuDF, but I'm curious to hear > > Jacques'/Dremio's experience too. > > > > Metadata size has been an issue at different points for us. We do > definitely see datasets with 1000+ columns. It is also compounded by the > fact that as we add more columns, we typically decrease row count so that > the individual batches are still easily pipelined--which further increases > the relative ratio between data and metadata. >
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
I'm ok with the change and 0.15 release to better manage it. > I've always understood the metadata to be a few dozen/hundred KB, a > small percentage of the total message size. I could be underestimating > the ratios though -- is it common to have tables w/ 1000+ columns? I've > seen a few reports like that in cuDF, but I'm curious to hear > Jacques'/Dremio's experience too. > Metadata size has been an issue at different points for us. We do definitely see datasets with 1000+ columns. It is also compounded by the fact that as we add more columns, we typically decrease row count so that the individual batches are still easily pipelined--which further increases the relative ratio between data and metadata.
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
I'm not sure I understand this suggestion: 1. Wouldn't this cause old readers to miss the last 4 bytes of the buffer (and provide meaningless bytes at the beginning). 2. The current proposal on the other thread is to have the pattern be <0x> Sorry I didn't mean to say an int64_t length, just that now we'd be reserving 8 bytes in the "metadata length" position where today we reserve 4. I'm not sure about every language, but at least in Python/JS an external forwards-compatible solution would involve slicing the message buffer up front like this: def adjust_message_buffer(message_bytes): buf = pa.py_buffer(message_bytes) if first_four_bytes_are_max_int32(message_bytes): return buf.slice(4) return buf On 7/23/19 7:31 PM, Micah Kornfield wrote: Could we detect the 4-byte length, incur a penalty copying the memory to an aligned buffer, then continue consuming the stream? I think that is the plan (or at least would be my plan) if we go ahead with the change (It's probably fine if we only write the 8-byte length, since consumers on older versions of Arrow could slice from the 4th byte before passing a buffer to the reader). I'm not sure I understand this suggestion: 1. Wouldn't this cause old readers to miss the last 4 bytes of the buffer (and provide meaningless bytes at the beginning). 2. The current proposal on the other thread is to have the pattern be <0x> Thanks, Micah On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor wrote: +1 for a 0.15.0 before 1.0 if we go ahead with this. I'm curious to hear other's thoughts about compatibility. I think we should avoid breaking backwards compatibility if possible. It's common for apps/libs to be pinned on specific Arrow versions, and I worry it'd cause a lot of work for downstream devs to audit their tool suite for full Arrow binary compatibility (and/or require their customers to do the same). Could we detect the 4-byte length, incur a penalty copying the memory to an aligned buffer, then continue consuming the stream? (It's probably fine if we only write the 8-byte length, since consumers on older versions of Arrow could slice from the 4th byte before passing a buffer to the reader). I've always understood the metadata to be a few dozen/hundred KB, a small percentage of the total message size. I could be underestimating the ratios though -- is it common to have tables w/ 1000+ columns? I've seen a few reports like that in cuDF, but I'm curious to hear Jacques'/Dremio's experience too. If copying is feasible, it doesn't seem so bad a trade-off to maintain backwards-compatibility. As libraries and consumers upgrade their Arrow dependencies, the 4-byte length will be less and less common, and they'll be less likely to pay the cost. On 7/23/19 2:22 AM, Uwe L. Korn wrote: It is also a good way to test the change in public. We don't want to adjust something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing adjustments due to issues that appear "in the wild" is psychologically the easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would plan to minimize what is changed between 1.0 and pre-1.0. This also should save us maintainers some time as I would expect different behaviour in bug reports between 1.0 and pre-1.0 issues. Uwe On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote: I think the main reason to do a release before 1.0.0 is if we want to make the change that would give a good error message for forward incompatibility (I think this could be done as 0.14.2 since it would just be clarifying an error message). Otherwise, I think including it in 1.0.0 would be fine (its still not clear to me if there is consensus to fix the issue). Thanks, Micah On Monday, July 22, 2019, Wes McKinney wrote: I'd be satisfied with fixing the Flatbuffer alignment issue either in a 0.15.0 or 1.0.0. In the interest of expediency, though, making a 0.15.0 with this change sooner rather than later might be prudent. On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou wrote: Hello, Recently we've discussed breaking the IPC format to fix a long-standing alignment issue. See this discussion: https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E Should we first do a 0.15.0 in order to get those format fixes right? Once that is fine and settled we can move to the 1.0.0 release? Regards Antoine.
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
> > Could we detect the 4-byte length, incur a penalty copying the memory to > an aligned buffer, then continue consuming the stream? I think that is the plan (or at least would be my plan) if we go ahead with the change > (It's probably > fine if we only write the 8-byte length, since consumers on older > versions of Arrow could slice from the 4th byte before passing a buffer > to the reader). I'm not sure I understand this suggestion: 1. Wouldn't this cause old readers to miss the last 4 bytes of the buffer (and provide meaningless bytes at the beginning). 2. The current proposal on the other thread is to have the pattern be <0x> Thanks, Micah On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor wrote: > +1 for a 0.15.0 before 1.0 if we go ahead with this. > > I'm curious to hear other's thoughts about compatibility. I think we > should avoid breaking backwards compatibility if possible. It's common > for apps/libs to be pinned on specific Arrow versions, and I worry it'd > cause a lot of work for downstream devs to audit their tool suite for > full Arrow binary compatibility (and/or require their customers to do > the same). > > Could we detect the 4-byte length, incur a penalty copying the memory to > an aligned buffer, then continue consuming the stream? (It's probably > fine if we only write the 8-byte length, since consumers on older > versions of Arrow could slice from the 4th byte before passing a buffer > to the reader). > > I've always understood the metadata to be a few dozen/hundred KB, a > small percentage of the total message size. I could be underestimating > the ratios though -- is it common to have tables w/ 1000+ columns? I've > seen a few reports like that in cuDF, but I'm curious to hear > Jacques'/Dremio's experience too. > > If copying is feasible, it doesn't seem so bad a trade-off to maintain > backwards-compatibility. As libraries and consumers upgrade their Arrow > dependencies, the 4-byte length will be less and less common, and > they'll be less likely to pay the cost. > > > > On 7/23/19 2:22 AM, Uwe L. Korn wrote: > > It is also a good way to test the change in public. We don't want to > adjust something like this anymore in a 1.0.0 release. Already doing this > in 0.15.0 and then maybe doing adjustments due to issues that appear "in > the wild" is psychologically the easier way. There is a lot of thinking of > users bound with the magic 1.0, thus I would plan to minimize what is > changed between 1.0 and pre-1.0. This also should save us maintainers some > time as I would expect different behaviour in bug reports between 1.0 and > pre-1.0 issues. > > > > Uwe > > > > On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote: > >> I think the main reason to do a release before 1.0.0 is if we want to > make > >> the change that would give a good error message for forward > incompatibility > >> (I think this could be done as 0.14.2 since it would just be clarifying > an > >> error message). Otherwise, I think including it in 1.0.0 would be fine > >> (its still not clear to me if there is consensus to fix the issue). > >> > >> Thanks, > >> Micah > >> > >> > >> On Monday, July 22, 2019, Wes McKinney wrote: > >> > >>> I'd be satisfied with fixing the Flatbuffer alignment issue either in > >>> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a > >>> 0.15.0 with this change sooner rather than later might be prudent. > >>> > >>> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou > >>> wrote: > > Hello, > > Recently we've discussed breaking the IPC format to fix a > long-standing > alignment issue. See this discussion: > > >>> > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E > Should we first do a 0.15.0 in order to get those format fixes right? > Once that is fine and settled we can move to the 1.0.0 release? > > Regards > > Antoine. > > >
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
+1 for a 0.15.0 before 1.0 if we go ahead with this. I'm curious to hear other's thoughts about compatibility. I think we should avoid breaking backwards compatibility if possible. It's common for apps/libs to be pinned on specific Arrow versions, and I worry it'd cause a lot of work for downstream devs to audit their tool suite for full Arrow binary compatibility (and/or require their customers to do the same). Could we detect the 4-byte length, incur a penalty copying the memory to an aligned buffer, then continue consuming the stream? (It's probably fine if we only write the 8-byte length, since consumers on older versions of Arrow could slice from the 4th byte before passing a buffer to the reader). I've always understood the metadata to be a few dozen/hundred KB, a small percentage of the total message size. I could be underestimating the ratios though -- is it common to have tables w/ 1000+ columns? I've seen a few reports like that in cuDF, but I'm curious to hear Jacques'/Dremio's experience too. If copying is feasible, it doesn't seem so bad a trade-off to maintain backwards-compatibility. As libraries and consumers upgrade their Arrow dependencies, the 4-byte length will be less and less common, and they'll be less likely to pay the cost. On 7/23/19 2:22 AM, Uwe L. Korn wrote: It is also a good way to test the change in public. We don't want to adjust something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing adjustments due to issues that appear "in the wild" is psychologically the easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would plan to minimize what is changed between 1.0 and pre-1.0. This also should save us maintainers some time as I would expect different behaviour in bug reports between 1.0 and pre-1.0 issues. Uwe On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote: I think the main reason to do a release before 1.0.0 is if we want to make the change that would give a good error message for forward incompatibility (I think this could be done as 0.14.2 since it would just be clarifying an error message). Otherwise, I think including it in 1.0.0 would be fine (its still not clear to me if there is consensus to fix the issue). Thanks, Micah On Monday, July 22, 2019, Wes McKinney wrote: I'd be satisfied with fixing the Flatbuffer alignment issue either in a 0.15.0 or 1.0.0. In the interest of expediency, though, making a 0.15.0 with this change sooner rather than later might be prudent. On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou wrote: Hello, Recently we've discussed breaking the IPC format to fix a long-standing alignment issue. See this discussion: https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E Should we first do a 0.15.0 in order to get those format fixes right? Once that is fine and settled we can move to the 1.0.0 release? Regards Antoine.
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
It is also a good way to test the change in public. We don't want to adjust something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing adjustments due to issues that appear "in the wild" is psychologically the easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would plan to minimize what is changed between 1.0 and pre-1.0. This also should save us maintainers some time as I would expect different behaviour in bug reports between 1.0 and pre-1.0 issues. Uwe On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote: > I think the main reason to do a release before 1.0.0 is if we want to make > the change that would give a good error message for forward incompatibility > (I think this could be done as 0.14.2 since it would just be clarifying an > error message). Otherwise, I think including it in 1.0.0 would be fine > (its still not clear to me if there is consensus to fix the issue). > > Thanks, > Micah > > > On Monday, July 22, 2019, Wes McKinney wrote: > > > I'd be satisfied with fixing the Flatbuffer alignment issue either in > > a 0.15.0 or 1.0.0. In the interest of expediency, though, making a > > 0.15.0 with this change sooner rather than later might be prudent. > > > > On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou > > wrote: > > > > > > > > > Hello, > > > > > > Recently we've discussed breaking the IPC format to fix a long-standing > > > alignment issue. See this discussion: > > > > > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E > > > > > > Should we first do a 0.15.0 in order to get those format fixes right? > > > Once that is fine and settled we can move to the 1.0.0 release? > > > > > > Regards > > > > > > Antoine. > > >
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
I think the main reason to do a release before 1.0.0 is if we want to make the change that would give a good error message for forward incompatibility (I think this could be done as 0.14.2 since it would just be clarifying an error message). Otherwise, I think including it in 1.0.0 would be fine (its still not clear to me if there is consensus to fix the issue). Thanks, Micah On Monday, July 22, 2019, Wes McKinney wrote: > I'd be satisfied with fixing the Flatbuffer alignment issue either in > a 0.15.0 or 1.0.0. In the interest of expediency, though, making a > 0.15.0 with this change sooner rather than later might be prudent. > > On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou > wrote: > > > > > > Hello, > > > > Recently we've discussed breaking the IPC format to fix a long-standing > > alignment issue. See this discussion: > > > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E > > > > Should we first do a 0.15.0 in order to get those format fixes right? > > Once that is fine and settled we can move to the 1.0.0 release? > > > > Regards > > > > Antoine. >
Re: [Discuss] Do a 0.15.0 release before 1.0.0?
I'd be satisfied with fixing the Flatbuffer alignment issue either in a 0.15.0 or 1.0.0. In the interest of expediency, though, making a 0.15.0 with this change sooner rather than later might be prudent. On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou wrote: > > > Hello, > > Recently we've discussed breaking the IPC format to fix a long-standing > alignment issue. See this discussion: > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E > > Should we first do a 0.15.0 in order to get those format fixes right? > Once that is fine and settled we can move to the 1.0.0 release? > > Regards > > Antoine.
[Discuss] Do a 0.15.0 release before 1.0.0?
Hello, Recently we've discussed breaking the IPC format to fix a long-standing alignment issue. See this discussion: https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E Should we first do a 0.15.0 in order to get those format fixes right? Once that is fine and settled we can move to the 1.0.0 release? Regards Antoine.