Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?

2021-09-16 Thread Jacob Quinn
Good question.

In my mind, I was imagining the arrow-julia repo would have a fully
decoupled versioning from the main arrow project. This comes from my
understanding that the julia implementation is it's own "project" that
implements the arrow spec/format, and we may need a breaking major release
at different cadences than the main spec version. Indeed, while the arrow
project has gone from 2.0 -> 6.0 since the julia implementation was first
released, we're just now releasing our own 2.0.0 version after a change in
API for how metadata is set/retrieved on table/column objects.

I'll admit that it's not entirely clear to me how to best signal/implement
coordination between the main arrow project versions and the julia version
though. I'm just guessing here, but is that why the main arrow project does
so frequent major version releases? To account for any child
implementations happening to have breaking changes? I think I remember
discussion recently around moving the actual spec/format document out as a
separate repo or at least versioning it separately from all the various
implementations, and that seems like it would be a good idea, though I
guess the format itself has versioning builtin to itself. It's certainly
something we can clarify in the Julia package itself; i.e. which version of
the spec a given Julia package version is compatible with. Typically with
other julia package dependencies, just a minor version increment is
required when a new breaking dependency version is upgraded, so I would
think we could follow something similar by treating the arrow format as a
"dependency".

I'll clarify that I don't feel very strongly on these points, so if there's
something I'm missing or gaps in my understanding of how the rest of the
web of projects are coordinating things, I'm all ears.

-Jacob

On Thu, Sep 16, 2021 at 11:24 PM Sutou Kouhei  wrote:

> Hi,
>
> Good point! Jacob, could you confirm this?
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?" on Sat, 11
> Sep 2021 16:57:17 -0700,
>   QP Hou  wrote:
>
> > Just one minor point to confirm and clarify. It looks like Julia arrow
> only
> > wants to do on demand minor and patch releases. Major version release
> still
> > needs to be aligned with the main arrow release schedule, is that
> correct?
> > In other words, breaking changes should be avoided in on demand releases
> > (assuming they are using semantic versioning).
> >
> > From the original julia donation thread, I got the impression that the
> > julia maintainers wanted to have their own versioning scheme. Maybe
> that’s
> > not the case anymore. So I wanted to make sure we set the right
> expectation
> > for Julia maintainers.
> >
> > FWIW, Arrow-rs today aligns the major version with the main arrow
> release,
> > so Andrew spend quite a bit of time maintaining an active release branch
> to
> > backport backwards compatible commits for minor and patch releases.
> > Datadusion and ballista on the other hand has a versioning scheme that’s
> > fully decoupled from the main Arrow version including the major version.
> >
> > On Thu, Sep 9, 2021 at 1:38 PM Sutou Kouhei  wrote:
> >
> >> Hi,
> >>
> >> Thanks for all comments about release schedule.
> >>
> >> Let's use release-on-demand approach based on
> >> arrow-datafusion's flow for the Julia Arrow implementation.
> >>
> >> Do we have more items to be discussed? Can we start voting?
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>   "Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?" on Thu, 9
> >> Sep 2021 09:48:57 -0400,
> >>   Andrew Lamb  wrote:
> >>
> >> > I also think release on demand is a good strategy.
> >> >
> >> > The primary reasons to do an arrow-rs release every 2 weeks were:
> >> > 1. To have predictable cadence into downstream projects (e.g.
> datafusion
> >> > and others)
> >> > 2. Amortize the overhead associated with each release (the process is
> non
> >> > trivial and the current 72 hour voting window adds some backpressure
> as
> >> > well -- I remember Wes may have said windows shorter than 72 hours
> might
> >> be
> >> > fine too)
> >> >
> >> >
> >> > On Wed, Sep 8, 2021 at 12:19 AM QP Hou 
> wrote:
> >> >
> >> >> A minor note on the Rust side of things. arrow-rs has a 2 weeks
> >> >> release cycle, but arrow-datafusion mostly does release on demand at
> >> >> the moment. Our most uptodate release processes are documented at [1]
> >> >> and [2].
> >> >>
> >> >> [1]:
> >> https://github.com/apache/arrow-rs/blob/master/dev/release/README.md
> >> >> [2]:
> >> >>
> >>
> https://github.com/apache/arrow-datafusion/blob/master/dev/release/README.md
> >> >>
> >> >> On Tue, Sep 7, 2021 at 4:01 PM Jacob Quinn 
> >> wrote:
> >> >> >
> >> >> > Thanks kou.
> >> >> >
> >> >> > I think the TODO action list looks good.
> >> >> >
> >> >> > The one point I think could use some additional discussion is
> around
> >> the
> >> >> > release cadence: it IS desirable to be able to 

Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?

2021-09-16 Thread Sutou Kouhei
Hi,

Good point! Jacob, could you confirm this?


Thanks,
-- 
kou

In 
  "Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?" on Sat, 11 Sep 
2021 16:57:17 -0700,
  QP Hou  wrote:

> Just one minor point to confirm and clarify. It looks like Julia arrow only
> wants to do on demand minor and patch releases. Major version release still
> needs to be aligned with the main arrow release schedule, is that correct?
> In other words, breaking changes should be avoided in on demand releases
> (assuming they are using semantic versioning).
> 
> From the original julia donation thread, I got the impression that the
> julia maintainers wanted to have their own versioning scheme. Maybe that’s
> not the case anymore. So I wanted to make sure we set the right expectation
> for Julia maintainers.
> 
> FWIW, Arrow-rs today aligns the major version with the main arrow release,
> so Andrew spend quite a bit of time maintaining an active release branch to
> backport backwards compatible commits for minor and patch releases.
> Datadusion and ballista on the other hand has a versioning scheme that’s
> fully decoupled from the main Arrow version including the major version.
> 
> On Thu, Sep 9, 2021 at 1:38 PM Sutou Kouhei  wrote:
> 
>> Hi,
>>
>> Thanks for all comments about release schedule.
>>
>> Let's use release-on-demand approach based on
>> arrow-datafusion's flow for the Julia Arrow implementation.
>>
>> Do we have more items to be discussed? Can we start voting?
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?" on Thu, 9
>> Sep 2021 09:48:57 -0400,
>>   Andrew Lamb  wrote:
>>
>> > I also think release on demand is a good strategy.
>> >
>> > The primary reasons to do an arrow-rs release every 2 weeks were:
>> > 1. To have predictable cadence into downstream projects (e.g. datafusion
>> > and others)
>> > 2. Amortize the overhead associated with each release (the process is non
>> > trivial and the current 72 hour voting window adds some backpressure as
>> > well -- I remember Wes may have said windows shorter than 72 hours might
>> be
>> > fine too)
>> >
>> >
>> > On Wed, Sep 8, 2021 at 12:19 AM QP Hou  wrote:
>> >
>> >> A minor note on the Rust side of things. arrow-rs has a 2 weeks
>> >> release cycle, but arrow-datafusion mostly does release on demand at
>> >> the moment. Our most uptodate release processes are documented at [1]
>> >> and [2].
>> >>
>> >> [1]:
>> https://github.com/apache/arrow-rs/blob/master/dev/release/README.md
>> >> [2]:
>> >>
>> https://github.com/apache/arrow-datafusion/blob/master/dev/release/README.md
>> >>
>> >> On Tue, Sep 7, 2021 at 4:01 PM Jacob Quinn 
>> wrote:
>> >> >
>> >> > Thanks kou.
>> >> >
>> >> > I think the TODO action list looks good.
>> >> >
>> >> > The one point I think could use some additional discussion is around
>> the
>> >> > release cadence: it IS desirable to be able to release more frequently
>> >> than
>> >> > the parent repo 3-4 month cadence. But we also haven't had the
>> frequency
>> >> of
>> >> > commits to necessarily warrant a release every 2 weeks. I can think of
>> >> two
>> >> > possible options, not sure if one or the other would be more
>> compatible
>> >> > with the apache release process:
>> >> >
>> >> > 1) Allow for release-on-demand; this is idiomatic for most Julia
>> packages
>> >> > I'm aware of. When a particular bug is fixed, or feature added, a user
>> >> can
>> >> > request a release, a little discussion happens, and a new release is
>> >> made.
>> >> > This approach would work well for the "bursty" kind of contributions
>> >> we've
>> >> > seen to Arrow.jl where development by certain people will happen
>> >> frequently
>> >> > for a while, then take a break for other things. This also avoids
>> having
>> >> > "scheduled" releases (every 2 weeks, 3 months, etc.) where there
>> hasn't
>> >> > been significant updates to necessarily warrant a new release. This
>> >> > approach may also facilitate differentiating between bugfix (patch)
>> >> > releases vs. new functionality releases (minor), since when a release
>> is
>> >> > requested, it could be specified whether it should be patch or minor
>> (or
>> >> > major).
>> >> >
>> >> > 2) Commit to a scheduled release pattern like every 2 weeks, once a
>> >> month,
>> >> > etc. This has the advantage of consistency and clearer expectations
>> for
>> >> > users/devs involved. A release also doesn't need to be requested,
>> because
>> >> > we can just wait for the scheduled time to release. In terms of the
>> >> > "unnecessary releases" mentioned above, it could be as simple as
>> >> > "cancelling" a release if there hasn't been significant updates in the
>> >> > elapsed time period.
>> >> >
>> >> > My preference would be for 1), but that's influenced from what I'm
>> >> familiar
>> >> > with in the Julia package ecosystem. It seems like it would still fit
>> in
>> >> > the apache way since we would formally request a new release, wait 

Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

2021-09-16 Thread QP Hou
I would be interested in meeting with more contributors "face to face"
and chime in to help move these major initiatives forward in any way I
can :)

--
QP

On Thu, Sep 16, 2021 at 6:55 AM Rémi Dettai  wrote:
>
> I am also very interested in re-instoring these events, at least
> occasionally.
>
> I do think that sharing some higher level goals and ideas in more *informal
> *discussions could help us understand each other better in our asynchronous
> work (design documents, issues, PRs).
>
> I also agree that no decision should be taken during these calls. An
> interesting format could be that each time, one or two participants, on a
> volunteering basis, share a small presentation of their work on/around
> Arrow/Datafusion, the time they have available to spend on it, maybe their
> overall vision of what they would like the project to become...
>
> Remi
>
> Le jeu. 16 sept. 2021 à 15:37, Andrew Lamb  a écrit :
>
> >  A lot has been happening in DataFusion and Arrow  since  we stopped the
> > Rust specific sync calls (see mailing list thread [1] on the topic).
> >
> > I would like to gauge interest in restarting the calls
> >
> > I think a call could be valuable to:
> > 1. Help "put a face to the name" of some of other contributors we are
> > working with
> > 2. Discuss / synchronize on the goals and major initiatives from different
> > stakeholders to identify areas where more alignment is needed
> >
> > Recent areas I am thinking about that might benefit from some in person
> > discussion are the object store API [2] and table provider splits [3].
> >
> > As always, we would ensure that minutes are sent out,  no decisions are
> > made on the call and anything of substance was discussed on this mailing
> > list or in github issues / google docs.
> >
> > Andrew
> >
> > [1]
> >
> > https://lists.apache.org/thread.html/rbeadc3b11bce8731c69617c8e0fe780a97055de0fcd739c378d9c0e1%40%3Cdev.arrow.apache.org%3E
> >
> > [2] https://github.com/apache/arrow-datafusion/pull/950
> >
> > [3] https://github.com/apache/arrow-datafusion/issues/1009
> >


Re: [DISCUSS] Leap seconds/days and day light saving for Duration types

2021-09-16 Thread QP Hou
Thank you for your feedback Weston and Antonie. I agree that ordering
discussion should be out of scope for the Arrow format spec. I have
removed reference of ordering in the PR so now the only change is
mentioning leap seconds to keep it consistent with other temporal
types.

I would like to add that even though we are not explicitly discussing
ordering in the spec, any kind of restriction we assign to a type
would still implicitly impact ordering in downstream compute kernels.
This is why I also took out the discussion of leap days in my PR as
well.

Thanks,
QP

On Tue, Sep 14, 2021 at 12:46 AM Antoine Pitrou  wrote:
>
>
> I agree with Weston that ordering isn't in the scope for the Arrow
> format spec (*).  For example, implementations are free to define UTF8
> comparisons and ordering as they wish (some may want to invest in the
> complexity of the official Unicode collation algorithm, others may be
> content with a simple codepoint-wise lexicographic comparison).  It
> doesn't prevent them from exchanging UTF8 data unambiguously using Arrow.
>
> (*) It may be in the scope for a hypothetical Compute IR spec, however.
>
> Regards
>
> Antoine.
>
>
> Le 14/09/2021 à 07:16, QP Hou a écrit :
> > Good point Weston. My proposal was written with the impression that
> > Arrow does want to define semantic for some of these temporal types
> > based on the existing comments in the Schema.fbs file.
> >
> > For example, here is a quote taken from the comments for the Time time:
> >
> > /// This definition doesn't allow for leap seconds. Time values from
> > /// measurements with leap seconds will need to be corrected when ingesting
> > /// into Arrow (for example by replacing the value 86400 with 86399).
> >
> > Here is another quote for the Date type:
> >
> > /// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch 
> > (no
> > /// leap seconds), where the values are evenly divisible by 8640
> >
> > For the interval type, we have:
> >
> > // A "calendar" interval which models types that don't necessarily
> > // have a precise duration without the context of a base timestamp (e.g.
> > // days can differ in length during day light savings time transitions).
> >
> > I think pushing the responsibility to define these semantics to the
> > data producer side is also a perfectly fine design with its own
> > trade-offs. It would make data exchange between two different systems
> > a little bit harder because consumers need to be aware of the
> > semantics defined by the producer. On the other hand, it does make the
> > producer implementation easier. It also makes data exchange within the
> > same system more efficient if that system's temporal type semantic is
> > different from what's defined in Arrow's spec.
> >
> > Either way, I think it would be good if we can be consistent on our
> > temporal type semantics in the spec. If we are making the claim that
> > leap seconds should not be taken into account for Time, Timestamp and
> > Date types, then it seems natural to make this claim for Interval type
> > as well. Alternatively, we could update the spec to make all temporal
> > types leap seconds agnostics.
> >
> > On Mon, Sep 13, 2021 at 12:03 PM Weston Pace  wrote:
> >>
> >> One could define a sorting based on 30 days months, 365 day years, and
> >> 24 hour days.  It would be consistent but can lead to some surprising
> >> results.  It appears that this is what postgres does as I got the
> >> following ordering for an interval:
> >>
> >> 359 days, 12 months, 360 days, 1 year, 365 days, 366 days
> >>
> >> On the other hand, Joda time forbids comparison of periods (their
> >> version of what we call an interval) and offers three ways to convert
> >> to a duration.  There is toDurationFrom(instant),
> >> toDurationTo(instant) which give durations from specific calendar
> >> ranges and then there is toStandardDuration() which converts to a
> >> duration based on 24 hour days.  However, toStandardDuration will
> >> still fail if the period has >0 months or years (presumably because
> >> months and years are too inconsistent).
> >>
> >> I'm not sure though that this is something that Arrow needs to define.
> >> We aren't specifying any invalid ranges of values.  I don't foresee
> >> any interoperability concerns.  A system that treated intervals as
> >> comparable (and didn't factor in DST, leap years, etc.) will read and
> >> write intervals the same way as a system that considers intervals
> >> incomparable.
> >>
> >> This question seems to fall into the "compute" space inhabited by
> >> topics like "is 'false && null' a false value or a null value" and
> >> "should addition overflow or throw an exception".
> >>
> >> On Mon, Sep 13, 2021 at 6:23 AM QP Hou  wrote:
> >>>
> >>> On Mon, Sep 13, 2021 at 6:18 AM Antoine Pitrou  wrote:
>  The Duration type is defined with a TimeUnit.  You are probably thinking
>  about the Interval type.
> 
> >>>
> >>> Oops, my bad, yes, it should be Interval 

Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

2021-09-16 Thread Rémi Dettai
I am also very interested in re-instoring these events, at least
occasionally.

I do think that sharing some higher level goals and ideas in more *informal
*discussions could help us understand each other better in our asynchronous
work (design documents, issues, PRs).

I also agree that no decision should be taken during these calls. An
interesting format could be that each time, one or two participants, on a
volunteering basis, share a small presentation of their work on/around
Arrow/Datafusion, the time they have available to spend on it, maybe their
overall vision of what they would like the project to become...

Remi

Le jeu. 16 sept. 2021 à 15:37, Andrew Lamb  a écrit :

>  A lot has been happening in DataFusion and Arrow  since  we stopped the
> Rust specific sync calls (see mailing list thread [1] on the topic).
>
> I would like to gauge interest in restarting the calls
>
> I think a call could be valuable to:
> 1. Help "put a face to the name" of some of other contributors we are
> working with
> 2. Discuss / synchronize on the goals and major initiatives from different
> stakeholders to identify areas where more alignment is needed
>
> Recent areas I am thinking about that might benefit from some in person
> discussion are the object store API [2] and table provider splits [3].
>
> As always, we would ensure that minutes are sent out,  no decisions are
> made on the call and anything of substance was discussed on this mailing
> list or in github issues / google docs.
>
> Andrew
>
> [1]
>
> https://lists.apache.org/thread.html/rbeadc3b11bce8731c69617c8e0fe780a97055de0fcd739c378d9c0e1%40%3Cdev.arrow.apache.org%3E
>
> [2] https://github.com/apache/arrow-datafusion/pull/950
>
> [3] https://github.com/apache/arrow-datafusion/issues/1009
>


[DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

2021-09-16 Thread Andrew Lamb
 A lot has been happening in DataFusion and Arrow  since  we stopped the
Rust specific sync calls (see mailing list thread [1] on the topic).

I would like to gauge interest in restarting the calls

I think a call could be valuable to:
1. Help "put a face to the name" of some of other contributors we are
working with
2. Discuss / synchronize on the goals and major initiatives from different
stakeholders to identify areas where more alignment is needed

Recent areas I am thinking about that might benefit from some in person
discussion are the object store API [2] and table provider splits [3].

As always, we would ensure that minutes are sent out,  no decisions are
made on the call and anything of substance was discussed on this mailing
list or in github issues / google docs.

Andrew

[1]
https://lists.apache.org/thread.html/rbeadc3b11bce8731c69617c8e0fe780a97055de0fcd739c378d9c0e1%40%3Cdev.arrow.apache.org%3E

[2] https://github.com/apache/arrow-datafusion/pull/950

[3] https://github.com/apache/arrow-datafusion/issues/1009