[Format] Timestamp timezone semantics?

2021-06-02 Thread Antoine Pitrou



Hello,

For the first time I notice this piece of information about the 
timestamp type:



  /// * If the time zone is set to a valid value, values can be displayed as
  ///   "localized" to that time zone, even though the underlying 64-bit
  ///   integers are identical to the same data stored in UTC. Converting
  ///   between time zones is a metadata-only operation and does not change the
  ///   underlying values


(from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223 )

This seems rather weird to me: timestamps always convey a UTC timestamp 
value, optionally decorated with a local timezone?  What is the 
motivation for such a representation?  It is unlike other systems such 
as Python, where a timezone-aware timestamp really expresses a local 
time value, not a UTC time value.


Thank you,

Antoine.


Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Joris Van den Bossche
On Wed, 2 Jun 2021 at 13:56, Antoine Pitrou  wrote:

>
> Hello,
>
> For the first time I notice this piece of information about the
> timestamp type:
>
> >   /// * If the time zone is set to a valid value, values can be
> displayed as
> >   ///   "localized" to that time zone, even though the underlying 64-bit
> >   ///   integers are identical to the same data stored in UTC. Converting
> >   ///   between time zones is a metadata-only operation and does not
> change the
> >   ///   underlying values
>
> (from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223 )
>
> This seems rather weird to me: timestamps always convey a UTC timestamp
> value, optionally decorated with a local timezone?  What is the
> motivation for such a representation?  It is unlike other systems such
> as Python, where a timezone-aware timestamp really expresses a local
> time value, not a UTC time value.
>

Just as reference: pandas uses the same model of storing UTC timestamps for
timezone-aware data (I think numpy also stored it as UTC, before they
removed support for it). And for example, I think also databases like
Postgresql store it as UTC internally, AFAIK.
The Python standard library datetime.datetime indeed stores localized
timestamps. But important difference is that Python actually stores the
year/month/day/hour/etc as separate values, so directly representing an
actual moment in time in a certain timezone. While I think what we store is
considered as "unix time"? (epoch since January 1st, 1970 at UTC) I am not
sure how you would store a timestamp in a certain timezone in this model.

Some advantages of storing UTC that come to mind: it makes converting from
one timezone to another a trivial (metadata-only) operation, makes easier
to do timestamp comparisons across timezones, and it makes
timedelta-arithmetic easier.

Joris


> Thank you,
>
> Antoine.
>


Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Antoine Pitrou



Le 02/06/2021 à 14:58, Joris Van den Bossche a écrit :

On Wed, 2 Jun 2021 at 13:56, Antoine Pitrou  wrote:



Hello,

For the first time I notice this piece of information about the
timestamp type:


   /// * If the time zone is set to a valid value, values can be

displayed as

   ///   "localized" to that time zone, even though the underlying 64-bit
   ///   integers are identical to the same data stored in UTC. Converting
   ///   between time zones is a metadata-only operation and does not

change the

   ///   underlying values


(from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223 )

This seems rather weird to me: timestamps always convey a UTC timestamp
value, optionally decorated with a local timezone?  What is the
motivation for such a representation?  It is unlike other systems such
as Python, where a timezone-aware timestamp really expresses a local
time value, not a UTC time value.



Just as reference: pandas uses the same model of storing UTC timestamps for
timezone-aware data (I think numpy also stored it as UTC, before they
removed support for it). And for example, I think also databases like
Postgresql store it as UTC internally, AFAIK.
The Python standard library datetime.datetime indeed stores localized
timestamps. But important difference is that Python actually stores the
year/month/day/hour/etc as separate values, so directly representing an
actual moment in time in a certain timezone. While I think what we store is
considered as "unix time"? (epoch since January 1st, 1970 at UTC) I am not
sure how you would store a timestamp in a certain timezone in this model.


Ah, my bad. I was under the (apparently mistaken) impression that Arrow 
was the exception here.


Regards

Antoine.


Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Joris Peeters
You could store epoch offsets, but interpret them in the local timezone.
E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New York
timezone.
At least one nasty problem with that is ambiguous times, i.e. when the
clock turns back on going from DST to ST, as well as invalid times (when
the clock moves forwards, meaning some epoch offsets never occur).

On Wed, Jun 2, 2021 at 1:58 PM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> On Wed, 2 Jun 2021 at 13:56, Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > For the first time I notice this piece of information about the
> > timestamp type:
> >
> > >   /// * If the time zone is set to a valid value, values can be
> > displayed as
> > >   ///   "localized" to that time zone, even though the underlying
> 64-bit
> > >   ///   integers are identical to the same data stored in UTC.
> Converting
> > >   ///   between time zones is a metadata-only operation and does not
> > change the
> > >   ///   underlying values
> >
> > (from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223
> )
> >
> > This seems rather weird to me: timestamps always convey a UTC timestamp
> > value, optionally decorated with a local timezone?  What is the
> > motivation for such a representation?  It is unlike other systems such
> > as Python, where a timezone-aware timestamp really expresses a local
> > time value, not a UTC time value.
> >
>
> Just as reference: pandas uses the same model of storing UTC timestamps for
> timezone-aware data (I think numpy also stored it as UTC, before they
> removed support for it). And for example, I think also databases like
> Postgresql store it as UTC internally, AFAIK.
> The Python standard library datetime.datetime indeed stores localized
> timestamps. But important difference is that Python actually stores the
> year/month/day/hour/etc as separate values, so directly representing an
> actual moment in time in a certain timezone. While I think what we store is
> considered as "unix time"? (epoch since January 1st, 1970 at UTC) I am not
> sure how you would store a timestamp in a certain timezone in this model.
>
> Some advantages of storing UTC that come to mind: it makes converting from
> one timezone to another a trivial (metadata-only) operation, makes easier
> to do timestamp comparisons across timezones, and it makes
> timedelta-arithmetic easier.
>
> Joris
>
>
> > Thank you,
> >
> > Antoine.
> >
>


Re: [Flight Extension] Request for Comments

2021-06-02 Thread Nate Bauernfeind
The thread isn't stale, and this is an appropriate question.

Caveat; I have not yet finished applying the feedback from this thread. So,
some of what I say below is not yet reflected in the oss offering (nor is
it reflected in the existing main branch of the barrage repo).

IMO there are two kinds of listener patterns:

1) The listener who wants to listen straight from Arrow. This listener
would initiate an Arrow Flight DoExchange to initiate the subscription (the
details of the subscription stored inside of a flatbuffer type
SubscriptionRequest encoded in the app_metadata of the client sent
FlightData). An update is a set of sequential RecordBatches. The first
RecordBatch of an update has an app_metadata flatbuffer for
BarrageUpdateMetadata (includes information like which rows were added,
modified, removed, etc). This metadata includes the number of add record
batches and the number of mod record batches. First the add batches come,
then the mod batches. In aggregate, this set of record batches represent a
full update. Thus, the listener could receive the set of batches, and the
metadata for the update all at once.

2) The listener who wants a shared object to maintain the data in the
subscription (might be a subscription on the whole table), which provides a
lighter callback that only describes which rows changed (not the data;
since the shared object can be asked). This pattern is ideal if you have
multiple listeners for the same set of data.

Our Java client adopted approach #2; our OSS offering contains a java
worker process that executes table operations. The table operations are
applied iteratively using a very similar update mechanism to the IPC
format. Our java client implementation allows you to pump the results of a
subscription into the local query engine's update mechanism; this
effectively chains multiple workers together.

Here is that implementation; note it is in Java, it doesn't technically use
the official implementation of arrow flight, and it doesn't reflect all of
the feedback we would like to apply.
https://github.com/nbauernfeind/deephaven-core/blob/doput/grpc-api-client/src/main/java/io/deephaven/grpc_api_client/table/BarrageSourcedTable.java

Our query engine listener interface is here:
https://github.com/deephaven/deephaven-core/blob/main/DB/src/main/java/io/deephaven/db/v2/ShiftAwareListener.java.
Again, it is very similar to the IPC just without the accompanying data;
the actual row data is intended to be accessed via other APIs.

For our C++ client, we are planning on stopping at the BarrageSourcedTable
equivalent. Our users could choose between the arrow flight stream or the
slightly more language friendly version that maintains the view. However,
if they want to do any client side analysis of the data they are on their
own (no filtering, aggregations, no ticking aside from the gRPC
subscription, etc).

Nate

P.S. While the deephaven-core repo is technically public, it is relatively
young and will move through a few API-breaking changes over the next few
months. (For example, applying the feedback from earlier in the thread will
break some of what exists today.)

On Tue, Jun 1, 2021 at 9:43 PM Paul Whalen  wrote:

> Hopefully this thread isn't too stale to pick back up with an open ended
> question.  What interface would a Barrage client library expose?  With
> Flight, application code cares about RecordBatches, but with Barrage it
> seems as though a client library ought to handle the updating of the table
> and expose that updated view to a client application.  But what
> specifically would that view be?
>
> In the last few months I've built out some Flight services that would
> benefit from a protocol like Barrage, and it renewed my interest enough to
> casually start a Go implementation based on Nate's documentation, just as a
> way of wrapping my head around the problem.  I was watching the repo Nate
> shared which ultimately led to the Java implementation embedded in
> Deephaven's open source offering, but since that is part of a larger
> application, it's a little hard to tell where the lines would be drawn.
>
> Paul
>
> On Tue, Mar 9, 2021 at 9:45 PM Micah Kornfield 
> wrote:
>
> > >
> > > As for schema evolution, I agree with what Micah proposes as a first
> > step.
> > > That would again add some overhead, perhaps. As for feasibility, at
> least
> > > on the C++/Python side, I think there would be a decent amount of
> > > refactoring needed, and there's also the question of how to expose this
> > in
> > > the API - the APIs there are based on reader/writer interfaces that
> don't
> > > expose schema evolution.
> >
> > One more option that might be too slow, is if a schema change is
> necessary,
> > a new flight endpoint is communicated and a new RPC is used?  (reusing
> the
> > same underlying channel could mitigate some performance issues here).
> >
> > On Tue, Mar 9, 2021 at 3:17 PM David Li  wrote:
> >
> > > There's not really any convention for the app_metadat

Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Adam Hooper
On Wed, Jun 2, 2021 at 7:56 AM Antoine Pitrou  wrote:

>
> This seems rather weird to me: timestamps always convey a UTC timestamp
> value, optionally decorated with a local timezone?  What is the
> motivation for such a representation?  It is unlike other systems such
> as Python


It's standard. I think the motivation is: "local timestamps are the worst
things in computing." (Oh no, here comes a rant!)

SQL gets timestamps completely wrong. MySQL, PostgreSQL, MSSQL and Oracle
all use similar words ("timestamp", "datetime", etc.) to mean different
things. Depending on the RDBMS, you need to think in five timezones --
server timezone, client timezone, database timezone, database-column
timezone and cell timezone. The syntax and semantics are different in all
database engines. (Personally, I always wince at Postgres' "TIMESTAMP WITH
TIMEZONE": it's the best practice because *it doesn't store a timezone*.
All RDBMSs are similarly absurd; props to MySQL for being slightly less
nonsensical than the rest.)

Python is based on C, and C has an obsession with "local time". What an
awful relic. Python `datetime` deals in wildly-inefficient 9-tuples, not
integers; and it happily stores and represents nonexistent times such as
`datetime.datetime(2018, 3, 11, 2, 30,
tzinfo=zoneinfo.ZoneInfo(key='US/Eastern'))`. Python's `time` module gets
you into C-land and integers; there, timezone-aware math only works in the
"local timezone", a global variable read from os.environ["TZ"] and cached
elsewhere in the module.

Local times are *hard to compare* (they jump around daylight savings);
they're *hard to validate* (some don't exist, others are ambiguous); and
they *cannot store future times* (future timezones are yet to be decreed by
politicians).

Don't follow in C or SQL's footsteps. Store timestamps as integers UTC
timestamps. Store timezone somewhere else; use it to convert to local
timezone when formatting and to convert to calendar for calendar math.

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com


Apache Arrow Rust Sync Call 6/2/2021

2021-06-02 Thread Andy Grove
Attendees


   -

   Benjamin Blodgett
   -

   Andy Grove
   -

   Jorn Horstmann
   -

   Andrew Lamb
   -

   Jorge Leitao


Topics Discussed


   -

   There were no agenda items raised.


Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Rok Mihevc
On Wed, Jun 2, 2021 at 3:23 PM Joris Peeters 
wrote:

> You could store epoch offsets, but interpret them in the local timezone.
> E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New York
> timezone.
> At least one nasty problem with that is ambiguous times, i.e. when the
> clock turns back on going from DST to ST, as well as invalid times (when
> the clock moves forwards, meaning some epoch offsets never occur).
>

Another problem is calendars change (see Adam's points) so the offset would
not be constant.


Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Julian Hyde
Good time libraries support all. E.g. Jodatime [1] has

* Instant - an instantaneous point on the time-line
* DateTime - full date and time with time-zone
* LocalDateTime - date-time without a time-zone

The SQL world isn't quite as much of a mess as Adam makes it out to
be. The SQL standard defines TIMESTAMP, DATE and TIME as zoneless
(like Joda's LocalDateTime) and most DBs have types that behave in
that way. Often those DBs also have types that behave like Instant and
DateTime (but naming is a little inconsistent).

I recommend that Arrow supports all three. Choose clear, distinct
names for all three, consistent with names used elsewhere in the
industry.

Any SQL interface to Arrow should follow the SQL standard. So, for
instance, if a column has TIMESTAMP type, it should behave as a
date-time without a time-zone.

Julian

[1] https://www.joda.org/joda-time/

On Wed, Jun 2, 2021 at 10:43 AM Rok Mihevc  wrote:
>
> On Wed, Jun 2, 2021 at 3:23 PM Joris Peeters 
> wrote:
>
> > You could store epoch offsets, but interpret them in the local timezone.
> > E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New York
> > timezone.
> > At least one nasty problem with that is ambiguous times, i.e. when the
> > clock turns back on going from DST to ST, as well as invalid times (when
> > the clock moves forwards, meaning some epoch offsets never occur).
> >
>
> Another problem is calendars change (see Adam's points) so the offset would
> not be constant.


Re: C++ Migrate from Arrow 0.16.0

2021-06-02 Thread Rares Vernica
Thanks for the pointers! The migration is going well.

We have been using Arrow 0.16.0 RecordBatchStreamWriter

with & without CompressedOutputStream and wrote the resulting Arrow Buffer
data to S3
 or file
system
. We
have a sizable amount of data saved this way.

Once we upgrade our C++ code to use Arrow 3.0.0 or 4.0.0, will it be
possible to read the Arrow steam files written with Arrow 0.16.0?

Thank you!
Rares

On Thu, May 27, 2021 at 1:44 PM Benjamin Kietzman 
wrote:

> Yes this is an adaptation of ARROW_ASSIGN_OR_RAISE for
> their bridge, which seems to throw exceptions instead of returning
> Status/Result
>
> On Thu, May 27, 2021 at 4:42 PM Micah Kornfield 
> wrote:
>
> > For the macro, I believe ARROW_ASSIGN_OR_RAISE already does this?
> >
> > On Thu, May 27, 2021 at 1:19 PM Benjamin Kietzman 
> > wrote:
> >
> > > unique_ptr is used to designate unique ownership of the buffer
> > > just created. It's fairly compatible with shared_ptr since
> > > unique_ptr can convert implicitly to shared_ptr.
> > >
> > > One other refactoring in play here: we've been moving from
> > > Status-returning-out-argument functions to the more ergonomic
> > > Result. I'd recommend you write a new macro for dealing with
> > > Results, like:
> > >
> > > #define ASSIGN_OR_THROW_IMPL(result_name, lhs, rexpr) \
> > > auto&& result_name = (rexpr); \
> > > THROW_NOT_OK((result_name).status()); \
> > > lhs = std::move(result_name).ValueUnsafe();
> > > #define ASSIGN_OR_THROW(lhs, rexpr) \
> > > ASSIGN_OR_THROW_IMPL(_maybe ## __COUNTER__, lhs, rexpr)
> > >
> > > Then lines such as
> > > https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L196
> > > can be rewritten as:
> > >
> > > ASSIGN_OR_THROW(buffer, arrow::AllocateBuffer(length));
> > >
> > > Does that help?
> > >
> > > On Thu, May 27, 2021 at 3:47 PM Rares Vernica 
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > We are trying to migrate from Arrow 0.16.0 to a newer version,
> > hopefully
> > > up
> > > > to 4.0.0. The Arrow 0.17.0 change in AllocateBuffer from taking a
> > > > shared_ptr to returning a unique_ptr is making things
> > > very
> > > > difficult. We wonder if there is a strong reason behind the change
> from
> > > > shared_ptr to unique_ptr and if there is an easier path forward for
> us.
> > > >
> > > > In our code, we interchangeably use Buffer and ResizableBuffer. We
> pass
> > > > around these pointers across a number of classes. They are allocated
> or
> > > > resized here
> > > > https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L191
> > > > Moreover,
> > > > we cast the ResizableBuffer instance to Buffer in order to have all
> our
> > > > methods only deal with Buffer, here
> > > > https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L151
> > > >
> > > > In Arrow 0.16.0 AllocateBuffer took a shared_ptr and this
> works
> > > > fine. In Arrow 0.17.0 AllocateBuffer returns a unique_ptr.
> Our
> > > cast
> > > > from ResizableBuffer to Buffer won't work on unique_ptr and we won't
> be
> > > > able to pass the Buffer around so easily.
> > > >
> > > > I noticed that there is another AllocateBuffer in MemoryManger that
> > > returns
> > > > a shared_ptr.
> > > >
> > > >
> > >
> >
> https://arrow.apache.org/docs/cpp/api/memory.html?highlight=resizablebuffer#_CPPv4N5arrow13MemoryManager14AllocateBufferE7int64_t
> > > > Is this a better alternative to allocate a buffer? Is there a similar
> > > > method to allocate a resizable buffer?
> > > >
> > > > Thank you,
> > > > Rares
> > > >
> > >
> >
>


Re: C++ Migrate from Arrow 0.16.0

2021-06-02 Thread Antoine Pitrou



Le 02/06/2021 à 21:57, Rares Vernica a écrit :

Thanks for the pointers! The migration is going well.

We have been using Arrow 0.16.0 RecordBatchStreamWriter

with & without CompressedOutputStream and wrote the resulting Arrow Buffer
data to S3
 or file
system
. We
have a sizable amount of data saved this way.

Once we upgrade our C++ code to use Arrow 3.0.0 or 4.0.0, will it be
possible to read the Arrow steam files written with Arrow 0.16.0?


It definitely should, unless it's a bug.
That said, for extra safety, I suggest you test loading the files before 
doing the final migration.


By the way, for saving lots of data to S3, it may be more efficient to 
use Parquet. It will be more CPU-intensive but will result in 
significant space savings.


Regards

Antoine.


Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Micah Kornfield
>
> Any SQL interface to Arrow should follow the SQL standard. So, for
> instance, if a column has TIMESTAMP type, it should behave as a
> date-time without a time-zone.


At least in bigquery we do the following mapping:
SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
SQL DATETIME -> Arrow Timestamp without a time-zone.

On Wed, Jun 2, 2021 at 12:39 PM Julian Hyde  wrote:

> Good time libraries support all. E.g. Jodatime [1] has
>
> * Instant - an instantaneous point on the time-line
> * DateTime - full date and time with time-zone
> * LocalDateTime - date-time without a time-zone
>
> The SQL world isn't quite as much of a mess as Adam makes it out to
> be. The SQL standard defines TIMESTAMP, DATE and TIME as zoneless
> (like Joda's LocalDateTime) and most DBs have types that behave in
> that way. Often those DBs also have types that behave like Instant and
> DateTime (but naming is a little inconsistent).
>
> I recommend that Arrow supports all three. Choose clear, distinct
> names for all three, consistent with names used elsewhere in the
> industry.
>
> Any SQL interface to Arrow should follow the SQL standard. So, for
> instance, if a column has TIMESTAMP type, it should behave as a
> date-time without a time-zone.
>
> Julian
>
> [1] https://www.joda.org/joda-time/
>
> On Wed, Jun 2, 2021 at 10:43 AM Rok Mihevc  wrote:
> >
> > On Wed, Jun 2, 2021 at 3:23 PM Joris Peeters  >
> > wrote:
> >
> > > You could store epoch offsets, but interpret them in the local
> timezone.
> > > E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New
> York
> > > timezone.
> > > At least one nasty problem with that is ambiguous times, i.e. when the
> > > clock turns back on going from DST to ST, as well as invalid times
> (when
> > > the clock moves forwards, meaning some epoch offsets never occur).
> > >
> >
> > Another problem is calendars change (see Adam's points) so the offset
> would
> > not be constant.
>


[ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread Wes McKinney
On behalf of the Arrow PMC, I'm happy to announce that Dominik has accepted an
invitation to become a committer on Apache Arrow. Welcome, and thank you
for your contributions!

Wes


[C++] Async Arrow Flight

2021-06-02 Thread Nate Bauernfeind
It seems to me that the c++ arrow flight implementation uses only the
synchronous version of the gRPC API. gRPC supports asynchronous message
delivery in C++ via a CompletionQueue that must be polled. Has there been
any desire to standardize on a solution for asynchronous use cases, perhaps
delivered via a provided CompletionQueue?

For a simple async grpc c++ example you can look here:
https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc

Thanks,
Nate

--


Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread Dominik Moritz
 Thank you for the warm welcome, Wes.

I look forward to continue working with you all on Arrow and in particular
the Arrow JavaScript library.

Dominik

On Jun 2, 2021 at 14:19:51, Wes McKinney  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Dominik has
> accepted an
> invitation to become a committer on Apache Arrow. Welcome, and thank you
> for your contributions!
>
> Wes
>


Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Julian Hyde


> On Jun 2, 2021, at 1:56 PM, Micah Kornfield  wrote:
> 
> 
> At least in bigquery we do the following mapping:
> SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
> SQL DATETIME -> Arrow Timestamp without a time-zone.

BigQuery was one of the systems I had in mind when I said "naming is a little 
inconsistent". BigQuery does have a type consistent with SQL-standard TIMESTAMP 
type but it’s called DATETIME. The TIMESTAMP type is something else.

I can literally count the number of hours and dollars that have been wasted 
because my colleagues assumed that BigQuery’s TIMESTAMP type would have the 
same semantics as TIMESTAMP in other databases.

Julian

Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread Brian Hulette
Congratulations Dominik! Well deserved!

Really excited to see some momentum in the JavaScript library

On Wed, Jun 2, 2021 at 2:44 PM Dominik Moritz  wrote:

>  Thank you for the warm welcome, Wes.
>
> I look forward to continue working with you all on Arrow and in particular
> the Arrow JavaScript library.
>
> Dominik
>
> On Jun 2, 2021 at 14:19:51, Wes McKinney  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Dominik has
> > accepted an
> > invitation to become a committer on Apache Arrow. Welcome, and thank you
> > for your contributions!
> >
> > Wes
> >
>


Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread Neal Richardson
Congratulations!

Neal

On Wed, Jun 2, 2021 at 3:23 PM Brian Hulette  wrote:

> Congratulations Dominik! Well deserved!
>
> Really excited to see some momentum in the JavaScript library
>
> On Wed, Jun 2, 2021 at 2:44 PM Dominik Moritz  wrote:
>
> >  Thank you for the warm welcome, Wes.
> >
> > I look forward to continue working with you all on Arrow and in
> particular
> > the Arrow JavaScript library.
> >
> > Dominik
> >
> > On Jun 2, 2021 at 14:19:51, Wes McKinney  wrote:
> >
> > > On behalf of the Arrow PMC, I'm happy to announce that Dominik has
> > > accepted an
> > > invitation to become a committer on Apache Arrow. Welcome, and thank
> you
> > > for your contributions!
> > >
> > > Wes
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread Micah Kornfield
Congrats!

On Wed, Jun 2, 2021 at 3:29 PM Neal Richardson 
wrote:

> Congratulations!
>
> Neal
>
> On Wed, Jun 2, 2021 at 3:23 PM Brian Hulette  wrote:
>
> > Congratulations Dominik! Well deserved!
> >
> > Really excited to see some momentum in the JavaScript library
> >
> > On Wed, Jun 2, 2021 at 2:44 PM Dominik Moritz 
> wrote:
> >
> > >  Thank you for the warm welcome, Wes.
> > >
> > > I look forward to continue working with you all on Arrow and in
> > particular
> > > the Arrow JavaScript library.
> > >
> > > Dominik
> > >
> > > On Jun 2, 2021 at 14:19:51, Wes McKinney  wrote:
> > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that Dominik has
> > > > accepted an
> > > > invitation to become a committer on Apache Arrow. Welcome, and thank
> > you
> > > > for your contributions!
> > > >
> > > > Wes
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread Rok Mihevc
Congrats Dominik!

On Thu, Jun 3, 2021 at 1:03 AM Micah Kornfield 
wrote:

> Congrats!
>
> On Wed, Jun 2, 2021 at 3:29 PM Neal Richardson <
> neal.p.richard...@gmail.com>
> wrote:
>
> > Congratulations!
> >
> > Neal
> >
> > On Wed, Jun 2, 2021 at 3:23 PM Brian Hulette 
> wrote:
> >
> > > Congratulations Dominik! Well deserved!
> > >
> > > Really excited to see some momentum in the JavaScript library
> > >
> > > On Wed, Jun 2, 2021 at 2:44 PM Dominik Moritz 
> > wrote:
> > >
> > > >  Thank you for the warm welcome, Wes.
> > > >
> > > > I look forward to continue working with you all on Arrow and in
> > > particular
> > > > the Arrow JavaScript library.
> > > >
> > > > Dominik
> > > >
> > > > On Jun 2, 2021 at 14:19:51, Wes McKinney 
> wrote:
> > > >
> > > > > On behalf of the Arrow PMC, I'm happy to announce that Dominik has
> > > > > accepted an
> > > > > invitation to become a committer on Apache Arrow. Welcome, and
> thank
> > > you
> > > > > for your contributions!
> > > > >
> > > > > Wes
> > > > >
> > > >
> > >
> >
>


Re: [C++] Async Arrow Flight

2021-06-02 Thread David Li
Hey Nate,

I think there's an open JIRA for something like this. I'd love to have 
something that plays nicely with asyncio/trio in Python and is hopefully more 
efficient. (I think it would also let us finally have per-message timeouts 
instead of only a per-call deadline.) There are some challenges though, e.g. we 
wouldn't expose gRPC's event loop directly so that we could support other 
transports, but then that leaves more things to design. I also recall the async 
C++ APIs being very underdocumented, I get the sense that they aren't actually 
used except to improve some benchmarks. I'll note for instance gRPC in Python, 
which offers async support, uses the "core" APIs directly and doesn't use 
anything C++ offers.

But long story short, if you're interested in this I think it would be a useful 
addition. What sorts of things would it enable for you?

-David

On Wed, Jun 2, 2021, at 16:20, Nate Bauernfeind wrote:
> It seems to me that the c++ arrow flight implementation uses only the
> synchronous version of the gRPC API. gRPC supports asynchronous message
> delivery in C++ via a CompletionQueue that must be polled. Has there been
> any desire to standardize on a solution for asynchronous use cases, perhaps
> delivered via a provided CompletionQueue?
> 
> For a simple async grpc c++ example you can look here:
> https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc
> 
> Thanks,
> Nate
> 
> --
> 

Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread David Li
Congratulations Dominik!

-David

On Wed, Jun 2, 2021, at 18:09, Rok Mihevc wrote:
> Congrats Dominik!
> 
> On Thu, Jun 3, 2021 at 1:03 AM Micah Kornfield  >
> wrote:
> 
> > Congrats!
> >
> > On Wed, Jun 2, 2021 at 3:29 PM Neal Richardson <
> > neal.p.richard...@gmail.com >
> > wrote:
> >
> > > Congratulations!
> > >
> > > Neal
> > >
> > > On Wed, Jun 2, 2021 at 3:23 PM Brian Hulette  > > >
> > wrote:
> > >
> > > > Congratulations Dominik! Well deserved!
> > > >
> > > > Really excited to see some momentum in the JavaScript library
> > > >
> > > > On Wed, Jun 2, 2021 at 2:44 PM Dominik Moritz  > > > >
> > > wrote:
> > > >
> > > > >  Thank you for the warm welcome, Wes.
> > > > >
> > > > > I look forward to continue working with you all on Arrow and in
> > > > particular
> > > > > the Arrow JavaScript library.
> > > > >
> > > > > Dominik
> > > > >
> > > > > On Jun 2, 2021 at 14:19:51, Wes McKinney  > > > > >
> > wrote:
> > > > >
> > > > > > On behalf of the Arrow PMC, I'm happy to announce that Dominik has
> > > > > > accepted an
> > > > > > invitation to become a committer on Apache Arrow. Welcome, and
> > thank
> > > > you
> > > > > > for your contributions!
> > > > > >
> > > > > > Wes
> > > > > >
> > > > >
> > > >
> > >
> >
> 

Re: C++ Migrate from Arrow 0.16.0

2021-06-02 Thread Micah Kornfield
I think the one place where it might break is for Union types (I seem to
recall a breaking change just prior to 1.0).

On Wed, Jun 2, 2021 at 1:00 PM Antoine Pitrou  wrote:

>
> Le 02/06/2021 à 21:57, Rares Vernica a écrit :
> > Thanks for the pointers! The migration is going well.
> >
> > We have been using Arrow 0.16.0 RecordBatchStreamWriter
> > <
> https://github.com/Paradigm4/bridge/blob/master/src/PhysicalXSave.cpp#L450
> >
> > with & without CompressedOutputStream and wrote the resulting Arrow
> Buffer
> > data to S3
> > 
> or file
> > system
> > .
> We
> > have a sizable amount of data saved this way.
> >
> > Once we upgrade our C++ code to use Arrow 3.0.0 or 4.0.0, will it be
> > possible to read the Arrow steam files written with Arrow 0.16.0?
>
> It definitely should, unless it's a bug.
> That said, for extra safety, I suggest you test loading the files before
> doing the final migration.
>
> By the way, for saving lots of data to S3, it may be more efficient to
> use Parquet. It will be more CPU-intensive but will result in
> significant space savings.
>
> Regards
>
> Antoine.
>