Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-24 Thread Bryan Cutler
+1 on a 0.15.0 release. At the minimum, if we could detect the stream and
provide a clear error message for Python and Java I think that would help
the transition. If we are also able to implement readers/writers that can
fallback to 4-byte prefix, then that would be nice to have.

On Wed, Jul 24, 2019 at 1:27 PM Jacques Nadeau  wrote:

> I'm ok with the change and 0.15 release to better manage it.
>
>
> > I've always understood the metadata to be a few dozen/hundred KB, a
> > small percentage of the total message size. I could be underestimating
> > the ratios though -- is it common to have tables w/ 1000+ columns? I've
> > seen a few reports like that in cuDF, but I'm curious to hear
> > Jacques'/Dremio's experience too.
> >
>
> Metadata size has been an issue at different points for us. We do
> definitely see datasets with 1000+ columns. It is also compounded by the
> fact that as we add more columns, we typically decrease row count so that
> the individual batches are still easily pipelined--which further increases
> the relative ratio between data and metadata.
>


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-24 Thread Jacques Nadeau
I'm ok with the change and 0.15 release to better manage it.


> I've always understood the metadata to be a few dozen/hundred KB, a
> small percentage of the total message size. I could be underestimating
> the ratios though -- is it common to have tables w/ 1000+ columns? I've
> seen a few reports like that in cuDF, but I'm curious to hear
> Jacques'/Dremio's experience too.
>

Metadata size has been an issue at different points for us. We do
definitely see datasets with 1000+ columns. It is also compounded by the
fact that as we add more columns, we typically decrease row count so that
the individual batches are still easily pipelined--which further increases
the relative ratio between data and metadata.


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-24 Thread Paul Taylor

I'm not sure I understand this suggestion:
1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
(and provide meaningless bytes at the beginning).
2.  The current proposal on the other thread is to have the pattern be
<0x>


Sorry I didn't mean to say an int64_t length, just that now we'd be 
reserving 8 bytes in the "metadata length" position where today we 
reserve 4.


I'm not sure about every language, but at least in Python/JS an external 
forwards-compatible solution would involve slicing the message buffer up 
front like this:


def adjust_message_buffer(message_bytes):
  buf = pa.py_buffer(message_bytes)
  if first_four_bytes_are_max_int32(message_bytes):
    return buf.slice(4)
  return buf



On 7/23/19 7:31 PM, Micah Kornfield wrote:

Could we detect the 4-byte length, incur a penalty copying the memory to
an aligned buffer, then continue consuming the stream?

I think that is the plan (or at least would be my plan) if we go ahead with
the change




(It's probably
fine if we only write the 8-byte length, since consumers on older
versions of Arrow could slice from the 4th byte before passing a buffer
to the reader).

I'm not sure I understand this suggestion:
1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
(and provide meaningless bytes at the beginning).
2.  The current proposal on the other thread is to have the pattern be
<0x>

Thanks,
Micah

On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor 
wrote:


+1 for a 0.15.0 before 1.0 if we go ahead with this.

I'm curious to hear other's thoughts about compatibility. I think we
should avoid breaking backwards compatibility if possible. It's common
for apps/libs to be pinned on specific Arrow versions, and I worry it'd
cause a lot of work for downstream devs to audit their tool suite for
full Arrow binary compatibility (and/or require their customers to do
the same).

Could we detect the 4-byte length, incur a penalty copying the memory to
an aligned buffer, then continue consuming the stream? (It's probably
fine if we only write the 8-byte length, since consumers on older
versions of Arrow could slice from the 4th byte before passing a buffer
to the reader).

I've always understood the metadata to be a few dozen/hundred KB, a
small percentage of the total message size. I could be underestimating
the ratios though -- is it common to have tables w/ 1000+ columns? I've
seen a few reports like that in cuDF, but I'm curious to hear
Jacques'/Dremio's experience too.

If copying is feasible, it doesn't seem so bad a trade-off to maintain
backwards-compatibility. As libraries and consumers upgrade their Arrow
dependencies, the 4-byte length will be less and less common, and
they'll be less likely to pay the cost.



On 7/23/19 2:22 AM, Uwe L. Korn wrote:

It is also a good way to test the change in public. We don't want to

adjust something like this anymore in a 1.0.0 release. Already doing this
in 0.15.0 and then maybe doing adjustments due to issues that appear "in
the wild" is psychologically the easier way. There is a lot of thinking of
users bound with the magic 1.0, thus I would plan to minimize what is
changed between 1.0 and pre-1.0. This also should save us maintainers some
time as I would expect different behaviour in bug reports between 1.0 and
pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:

I think the main reason to do a release before 1.0.0 is if we want to

make

the change that would give a good error message for forward

incompatibility

(I think this could be done as 0.14.2 since it would just be clarifying

an

error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney  wrote:


I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
wrote:

Hello,

Recently we've discussed breaking the IPC format to fix a

long-standing

alignment issue.  See this discussion:


https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.







Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Micah Kornfield
>
> Could we detect the 4-byte length, incur a penalty copying the memory to
> an aligned buffer, then continue consuming the stream?

I think that is the plan (or at least would be my plan) if we go ahead with
the change



> (It's probably
> fine if we only write the 8-byte length, since consumers on older
> versions of Arrow could slice from the 4th byte before passing a buffer
> to the reader).

I'm not sure I understand this suggestion:
1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
(and provide meaningless bytes at the beginning).
2.  The current proposal on the other thread is to have the pattern be
<0x>

Thanks,
Micah

On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor 
wrote:

> +1 for a 0.15.0 before 1.0 if we go ahead with this.
>
> I'm curious to hear other's thoughts about compatibility. I think we
> should avoid breaking backwards compatibility if possible. It's common
> for apps/libs to be pinned on specific Arrow versions, and I worry it'd
> cause a lot of work for downstream devs to audit their tool suite for
> full Arrow binary compatibility (and/or require their customers to do
> the same).
>
> Could we detect the 4-byte length, incur a penalty copying the memory to
> an aligned buffer, then continue consuming the stream? (It's probably
> fine if we only write the 8-byte length, since consumers on older
> versions of Arrow could slice from the 4th byte before passing a buffer
> to the reader).
>
> I've always understood the metadata to be a few dozen/hundred KB, a
> small percentage of the total message size. I could be underestimating
> the ratios though -- is it common to have tables w/ 1000+ columns? I've
> seen a few reports like that in cuDF, but I'm curious to hear
> Jacques'/Dremio's experience too.
>
> If copying is feasible, it doesn't seem so bad a trade-off to maintain
> backwards-compatibility. As libraries and consumers upgrade their Arrow
> dependencies, the 4-byte length will be less and less common, and
> they'll be less likely to pay the cost.
>
>
>
> On 7/23/19 2:22 AM, Uwe L. Korn wrote:
> > It is also a good way to test the change in public. We don't want to
> adjust something like this anymore in a 1.0.0 release. Already doing this
> in 0.15.0 and then maybe doing adjustments due to issues that appear "in
> the wild" is psychologically the easier way. There is a lot of thinking of
> users bound with the magic 1.0, thus I would plan to minimize what is
> changed between 1.0 and pre-1.0. This also should save us maintainers some
> time as I would expect different behaviour in bug reports between 1.0 and
> pre-1.0 issues.
> >
> > Uwe
> >
> > On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
> >> I think the main reason to do a release before 1.0.0 is if we want to
> make
> >> the change that would give a good error message for forward
> incompatibility
> >> (I think this could be done as 0.14.2 since it would just be clarifying
> an
> >> error message).  Otherwise, I think including it in 1.0.0 would be fine
> >> (its still not clear to me if there is consensus to fix the issue).
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >> On Monday, July 22, 2019, Wes McKinney  wrote:
> >>
> >>> I'd be satisfied with fixing the Flatbuffer alignment issue either in
> >>> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> >>> 0.15.0 with this change sooner rather than later might be prudent.
> >>>
> >>> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
> >>> wrote:
> 
>  Hello,
> 
>  Recently we've discussed breaking the IPC format to fix a
> long-standing
>  alignment issue.  See this discussion:
> 
> >>>
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
>  Should we first do a 0.15.0 in order to get those format fixes right?
>  Once that is fine and settled we can move to the 1.0.0 release?
> 
>  Regards
> 
>  Antoine.
>
>
>


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Paul Taylor

+1 for a 0.15.0 before 1.0 if we go ahead with this.

I'm curious to hear other's thoughts about compatibility. I think we 
should avoid breaking backwards compatibility if possible. It's common 
for apps/libs to be pinned on specific Arrow versions, and I worry it'd 
cause a lot of work for downstream devs to audit their tool suite for 
full Arrow binary compatibility (and/or require their customers to do 
the same).


Could we detect the 4-byte length, incur a penalty copying the memory to 
an aligned buffer, then continue consuming the stream? (It's probably 
fine if we only write the 8-byte length, since consumers on older  
versions of Arrow could slice from the 4th byte before passing a buffer 
to the reader).


I've always understood the metadata to be a few dozen/hundred KB, a 
small percentage of the total message size. I could be underestimating 
the ratios though -- is it common to have tables w/ 1000+ columns? I've 
seen a few reports like that in cuDF, but I'm curious to hear 
Jacques'/Dremio's experience too.


If copying is feasible, it doesn't seem so bad a trade-off to maintain 
backwards-compatibility. As libraries and consumers upgrade their Arrow 
dependencies, the 4-byte length will be less and less common, and 
they'll be less likely to pay the cost.




On 7/23/19 2:22 AM, Uwe L. Korn wrote:

It is also a good way to test the change in public. We don't want to adjust something 
like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing 
adjustments due to issues that appear "in the wild" is psychologically the 
easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would 
plan to minimize what is changed between 1.0 and pre-1.0. This also should save us 
maintainers some time as I would expect different behaviour in bug reports between 1.0 
and pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:

I think the main reason to do a release before 1.0.0 is if we want to make
the change that would give a good error message for forward incompatibility
(I think this could be done as 0.14.2 since it would just be clarifying an
error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney  wrote:


I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
wrote:


Hello,

Recently we've discussed breaking the IPC format to fix a long-standing
alignment issue.  See this discussion:


https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.





Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Uwe L. Korn
It is also a good way to test the change in public. We don't want to adjust 
something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 
and then maybe doing adjustments due to issues that appear "in the wild" is 
psychologically the easier way. There is a lot of thinking of users bound with 
the magic 1.0, thus I would plan to minimize what is changed between 1.0 and 
pre-1.0. This also should save us maintainers some time as I would expect 
different behaviour in bug reports between 1.0 and pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
> I think the main reason to do a release before 1.0.0 is if we want to make
> the change that would give a good error message for forward incompatibility
> (I think this could be done as 0.14.2 since it would just be clarifying an
> error message).  Otherwise, I think including it in 1.0.0 would be fine
> (its still not clear to me if there is consensus to fix the issue).
> 
> Thanks,
> Micah
> 
> 
> On Monday, July 22, 2019, Wes McKinney  wrote:
> 
> > I'd be satisfied with fixing the Flatbuffer alignment issue either in
> > a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> > 0.15.0 with this change sooner rather than later might be prudent.
> >
> > On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
> > wrote:
> > >
> > >
> > > Hello,
> > >
> > > Recently we've discussed breaking the IPC format to fix a long-standing
> > > alignment issue.  See this discussion:
> > >
> > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> > >
> > > Should we first do a 0.15.0 in order to get those format fixes right?
> > > Once that is fine and settled we can move to the 1.0.0 release?
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-22 Thread Micah Kornfield
I think the main reason to do a release before 1.0.0 is if we want to make
the change that would give a good error message for forward incompatibility
(I think this could be done as 0.14.2 since it would just be clarifying an
error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney  wrote:

> I'd be satisfied with fixing the Flatbuffer alignment issue either in
> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> 0.15.0 with this change sooner rather than later might be prudent.
>
> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
> wrote:
> >
> >
> > Hello,
> >
> > Recently we've discussed breaking the IPC format to fix a long-standing
> > alignment issue.  See this discussion:
> >
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> >
> > Should we first do a 0.15.0 in order to get those format fixes right?
> > Once that is fine and settled we can move to the 1.0.0 release?
> >
> > Regards
> >
> > Antoine.
>


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-22 Thread Wes McKinney
I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou  wrote:
>
>
> Hello,
>
> Recently we've discussed breaking the IPC format to fix a long-standing
> alignment issue.  See this discussion:
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
>
> Should we first do a 0.15.0 in order to get those format fixes right?
> Once that is fine and settled we can move to the 1.0.0 release?
>
> Regards
>
> Antoine.


[Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-22 Thread Antoine Pitrou


Hello,

Recently we've discussed breaking the IPC format to fix a long-standing
alignment issue.  See this discussion:
https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.