Re: Authentication Redeisgn

2020-08-20 Thread Ryan Murray
Hey James,

Looks like the doc is not entirely readable. I am getting:

'
File is in owner's trash
You will soon permanently lose access to this file. For continued access,
please make a copy.
'

When I access it.

Best,

Ryan Murray  | OSS Engineer

+447540852009 | rym...@dremio.com



On Wed, Aug 19, 2020 at 9:43 PM James Duong  wrote:

> Hi Arrow-Dev,
>
> I've written up a proposal to make some enhancements to the authentication
> process. I've captured the overall goals on JIRA here:
> https://issues.apache.org/jira/browse/ARROW-9804
>
> The proposal itself is in this Google Doc:
>
> https://docs.google.com/document/d/1TutHf9ttw3rLzBvYxJXP2q6fZHYmROu-gyKfniOntQY/edit#
>
> The goals are primarily to decouple authentication from the handshake
> process and integrate more easily with HTTP authentication mechanisms such
> as OAuth 2.0.
>
> There are suggestions to add some simple session-related communication
> (session properties, server capabilities) as part of the proposal.
>
> I've written a POC of the proposed changes here for authentication only
> (not the handshake):
> https://github.com/apache/arrow/pull/7994
>
> Thanks!
>
> --
>
> *James Duong*
> Lead Software Developer
> Bit Quill Technologies Inc.
> Direct: +1.604.562.6082 | jam...@bitquilltech.com
> https://www.bitquilltech.com
>
> This email message is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information.  Any unauthorized review,
> use, disclosure, or distribution is prohibited.  If you are not the
> intended recipient, please contact the sender by reply email and destroy
> all copies of the original message.  Thank you.
>


Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-24 Thread Ryan Murray
Hey Krisztian,

I agree. I ran it on LLVM/Clang 9 and 10 and it failed both times. Given
that everyone else had it succeed and there are a few tickets open around
LLVM > 8 I suspect it is Clang. I am currently running w/ 8 to see if it
passes. Do we have documents anywhere that list Gandiva's LLVM/Clang
requirements?

Best

Ryan

On Fri, Jul 24, 2020 at 1:18 PM Krisztián Szűcs 
wrote:

> On Tue, Jul 21, 2020 at 3:57 PM Ryan Murray  wrote:
> >
> > +0 (non-binding)
> >
> >
> > I verified source, release, binaries, integration tests for Python, C++,
> > Java. All went fine except for a failed test in c++ Gandiva: [  FAILED  ]
> > TestProjector.TestDateTime
>
> It's hard to evaluate without any context.
>
> I executed specifically this test case locally and it has passed.
> According to the rest of the votes I don't consider it as a blocker.
>
> >
> >
> > Not sure if this is known or expected?
> >
> >
> > On Tue, Jul 21, 2020 at 1:32 PM Andy Grove 
> wrote:
> >
> > > +1 (binding) on testing the Rust implementation only.
> > >
> > > I did notice that the release script is not updating all the versions
> > > correctly and I filed a JIRA [1].
> > >
> > > This shouldn't prevent the release though since this one version
> number can
> > > be updated manually when we publish the crates.
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-9537
> > >
> > > On Mon, Jul 20, 2020 at 8:08 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to propose the following release candidate (RC2) of
> Apache
> > > > Arrow version 1.0.0. This is a release consisting of 838
> > > > resolved JIRA issues[1].
> > > >
> > > > This release candidate is based on commit:
> > > > b0d623957db820de4f1ff0a5ebd3e888194a48f0 [2]
> > > >
> > > > The source release rc2 is hosted at [3].
> > > > The binary artifacts are hosted at [4][5][6][7].
> > > > The changelog is located at [8].
> > > >
> > > > Please download, verify checksums and signatures, run the unit tests,
> > > > and vote on the release. See [9] for how to validate a release
> candidate.
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Release this as Apache Arrow 1.0.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
> > > >
> > > > [1]:
> > > >
> > >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> > > > [2]:
> > > >
> > >
> https://github.com/apache/arrow/tree/b0d623957db820de4f1ff0a5ebd3e888194a48f0
> > > > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc2
> > > > [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc2
> > > > [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc2
> > > > [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc2
> > > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc2
> > > > [8]:
> > > >
> > >
> https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/CHANGELOG.md
> > > > [9]:
> > > >
> > >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > > >
> > >
>


Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Ryan Murray
+0 (non-binding)


I verified source, release, binaries, integration tests for Python, C++,
Java. All went fine except for a failed test in c++ Gandiva: [  FAILED  ]
TestProjector.TestDateTime


Not sure if this is known or expected?


On Tue, Jul 21, 2020 at 1:32 PM Andy Grove  wrote:

> +1 (binding) on testing the Rust implementation only.
>
> I did notice that the release script is not updating all the versions
> correctly and I filed a JIRA [1].
>
> This shouldn't prevent the release though since this one version number can
> be updated manually when we publish the crates.
>
> [1] https://issues.apache.org/jira/browse/ARROW-9537
>
> On Mon, Jul 20, 2020 at 8:08 PM Krisztián Szűcs  >
> wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC2) of Apache
> > Arrow version 1.0.0. This is a release consisting of 838
> > resolved JIRA issues[1].
> >
> > This release candidate is based on commit:
> > b0d623957db820de4f1ff0a5ebd3e888194a48f0 [2]
> >
> > The source release rc2 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 1.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
> >
> > [1]:
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> > [2]:
> >
> https://github.com/apache/arrow/tree/b0d623957db820de4f1ff0a5ebd3e888194a48f0
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc2
> > [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc2
> > [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc2
> > [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc2
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc2
> > [8]:
> >
> https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/CHANGELOG.md
> > [9]:
> >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >
>


Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Ryan Murray
I've tested Java and it looks good. However the verify script keeps on
bailing with protobuf related errors:
'cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc' and
friends cant find protobuf definitions. A bit odd as cmake can see protobuf
headers and builds directly off master work just fine. Has anyone else
experienced this? I am on ubutnu 18.04

On Fri, Jul 17, 2020 at 10:49 AM Antoine Pitrou  wrote:

>
> +1 (binding).  I tested on Ubuntu 18.04.
>
> * Wheels verification went fine.
> * Source verification went fine with CUDA enabled and
> TEST_INTEGRATION_JS=0 TEST_JS=0.
>
> I didn't test the binaries.
>
> Regards
>
> Antoine.
>
>
> Le 17/07/2020 à 03:41, Krisztián Szűcs a écrit :
> > Hi,
> >
> > I would like to propose the second release candidate (RC1) of Apache
> > Arrow version 1.0.0.
> > This is a major release consisting of 826 resolved JIRA issues[1].
> >
> > The verification of the first release candidate (RC0) has failed [0], and
> > the packaging scripts were unable to produce two wheels. Compared
> > to RC0 this release candidate includes additional patches for the
> > following bugs: ARROW-9506, ARROW-9504, ARROW-9497,
> > ARROW-9500, ARROW-9499.
> >
> > This release candidate is based on commit:
> > bc0649541859095ee77d03a7b891ea8d6e2fd641 [2]
> >
> > The source release rc1 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 1.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
> >
> > [0]: https://github.com/apache/arrow/pull/7778#issuecomment-659065370
> > [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> > [2]:
> https://github.com/apache/arrow/tree/bc0649541859095ee77d03a7b891ea8d6e2fd641
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc1
> > [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc1
> > [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc1
> > [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc1
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc1
> > [8]:
> https://github.com/apache/arrow/blob/bc0649541859095ee77d03a7b891ea8d6e2fd641/CHANGELOG.md
> > [9]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >
>


Re: Timeline for next major Arrow release (1.0.0)

2020-07-09 Thread Ryan Murray
I just submitted https://github.com/apache/arrow/pull/7697 to update Docs
for Java type support for 1.0.0, docs probably are not a blocker but should
be an easy review.

On Thu, Jul 9, 2020 at 7:19 PM Wes McKinney  wrote:

> hi David -- I agree with you, I just set ARROW-9265 to a Blocker and
> will spend some time reviewing today. We also need to validate and
> document the expected workflow for PySpark users who upgrade to
> pyarrow=1.x and need to run in a Spark cluster that's using an older
> Arrow Java library
>
> On Thu, Jul 9, 2020 at 12:44 PM David Li  wrote:
> >
> > I would submit that https://issues.apache.org/jira/browse/ARROW-9265
> > and https://issues.apache.org/jira/browse/ARROW-9362 should also be
> > considered blockers, or users are going to have a hard time migrating.
> > Also, without these PRs, no implementation currently actually writes
> > V5 metadata by default.
> >
> > Best,
> > David
> >
> > On 7/9/20, Neal Richardson  wrote:
> > > Hi all,
> > > Following up on this. Are we still on track to cut a release candidate
> on
> > > Monday?
> > >
> > > Some specific calls to action:
> > >
> > > * Looking at the release wiki page (
> > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release
> ),
> > > there are still 58 unstarted issues tagged for 1.0. Half of those are
> Rust
> > > issues that haven't been updated in over a month, so I suspect they
> aren't
> > > going to get resolved this week. Please do review the backlog and bump
> to
> > > 2.0 anything that isn't going to happen now.
> > >
> > > * There are also 5 issues listed as blockers. It's not clear to me that
> > > they're necessarily release blocking, that they're worth holding this
> > > release for, but if you feel strongly that they do and need more time
> to
> > > get them done, please speak up.
> > >
> > > * It appears that the various format changes from the past couple of
> weeks
> > > have gone through, and remaining tasks are not labeled as blockers. If
> > > there is something that you think *should* be considered a blocker,
> please
> > > indicate it as one.
> > >
> > > Neal
> > >
> > > On Thu, Jul 2, 2020 at 1:09 PM Wes McKinney 
> wrote:
> > >
> > >> hi folks,
> > >>
> > >> I hope you and your families are all well.
> > >>
> > >> We're heading into a holiday weekend here in the US -- I would guess
> > >> given the state of the backlog and nightly builds that the earliest we
> > >> could contemplate making the release will be the week of July 13. That
> > >> should give enough time next week to resolve the code changes related
> > >> to the Format votes under way along with other things that come up.
> > >>
> > >> In the meantime, if all stakeholders could please review the 1.0.0
> > >> backlog and remove issues that you do not believe will be completed in
> > >> the next 10 days with > 0.5 probability, that would be very helpful to
> > >> know where things stand viz-a-viz cutting an RC
> > >>
> > >> Thank you,
> > >> Wes
> > >>
> > >> On Mon, Jun 15, 2020 at 11:21 AM Wes McKinney 
> > >> wrote:
> > >> >
> > >> > hi folks,
> > >> >
> > >> > Based on the previous discussions about release timelines, the
> window
> > >> > for the next major release would be around the week of July 6. Does
> > >> > this sound reasonable?
> > >> >
> > >> > I see that Neal has created a wiki page to help track the burndown
> > >> >
> > >> >
> https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release
> > >> >
> > >> > There are currently 214 issues in the 1.0.0 backlog. Some of these
> are
> > >> > indeed blockers based on the criteria we've indicated for making the
> > >> > 1.0.0 release. Would project stakeholders please review their parts
> of
> > >> > the backlog and remove issues that aren't likely to be completed in
> > >> > the next 21 days?
> > >> >
> > >> > Thanks,
> > >> > Wes
> > >>
> > >
>


Re: Union integration test

2020-07-07 Thread Ryan Murray
sigh...thanks Antoine, that did the trick. That's what I get for
programming pre-coffee ;-)

On Tue, Jul 7, 2020 at 10:31 AM Antoine Pitrou  wrote:

>
> Hello Ryan
>
> Le 07/07/2020 à 11:21, Ryan Murray a écrit :
> >
> > While trying to finish up the Union integration test between Java & C++
> > https://github.com/apache/arrow/pull/7290 I came across a minor problem
> > that I hope I could get some clarification on.
> >
> > I have the integration test for C++ -> Java passing but Java -> C++ is
> > giving me trouble. The check here:
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L368
> > fails
> > as the Union buffers have no validity bitmap. Skipping this check for
> > Unions fixes the issue and integration tests pass.
>
> This check fails because you are sending a non-zero null count.  Since
> union arrays have no validity bitmap, they should have a zero null count
> (their child arrays, however, /can/ have embedded nulls).
>
> Regards
>
> Antoine.
>


Union integration test

2020-07-07 Thread Ryan Murray
Hey all,


While trying to finish up the Union integration test between Java & C++
https://github.com/apache/arrow/pull/7290 I came across a minor problem
that I hope I could get some clarification on.


I have the integration test for C++ -> Java passing but Java -> C++ is
giving me trouble. The check here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L368
fails
as the Union buffers have no validity bitmap. Skipping this check for
Unions fixes the issue and integration tests pass.


I want to make sure that skipping the check is the correct path and I am
not super familiar with the c++ code. So:


1) Is skipping the above check the correct solution (I assume not otherwise
C++->C++ integrations wouldn't pass)

2) Is it more likely that java is not writing the correct IPC message

3) Is the IPC reader not initialising buffers correctly for Unions

4) Is this a problem with MetadataVersion? I notice that neither Java or
C++ care about MetadataVersion::V5 yet, nor do they have a whole lot of
logic to read IPC based on MetadataVersion.


I am keen to have full C++/Java compatibility out for v1.0.0 and I would be
grateful for any guidance from experts here.


Best,

Ryan


Re: [VOTE] Removing validity bitmap from Arrow union types

2020-06-30 Thread Ryan Murray
+1 (non binding)


On Tue, Jun 30, 2020 at 5:29 AM Ben Kietzman 
wrote:

> +1 (non binding)
>
> On Tue, Jun 30, 2020, 00:24 Wes McKinney  wrote:
>
> > +1 (binding)
> >
> > On Mon, Jun 29, 2020 at 11:09 PM Micah Kornfield 
> > wrote:
> > >
> > > +1 (binding) (I had a couple of nits on language, that I put in the PR
> > >
> > > On Mon, Jun 29, 2020 at 2:24 PM Wes McKinney 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > As discussed on the mailing list [1], it has been proposed to remove
> > > > the validity bitmap buffer from Union types in the columnar format
> > > > specification and instead let value validity be determined
> exclusively
> > > > by constituent arrays of the union.
> > > >
> > > > One of the primary motivations for this is to simplify the creation
> of
> > > > unions, since constructing a validity bitmap that merges the
> > > > information contained in the child arrays' bitmaps is quite
> > > > complicated.
> > > >
> > > > Note that change breaks IPC forward compatibility for union types,
> > > > however implementations with hitherto spec-compliant union
> > > > implementations would be able to (at their discretion, of course)
> > > > preserve backward compatibility for deserializing "old" union data in
> > > > the case that the parent null count of the union is zero. The
> expected
> > > > impact of this breakage is low, particularly given that Unions have
> > > > been absent from integration testing and thus not recommended for
> > > > anything but ephemeral serialization.
> > > >
> > > > Under the assumption that the MetadataVersion V4 -> V5 version bump
> is
> > > > accepted, in order to protect against forward compatibility problems,
> > > > Arrow implementations would be forbidden from serializing union types
> > > > using the MetadataVersion::V4.
> > > >
> > > > A PR with the changes to Columnar.rst is at [2].
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Accept changes to Columnar.rst (removing union validity
> bitmaps)
> > > > [ ] +0
> > > > [ ] -1 Do not accept changes because...
> > > >
> > > > [1]:
> > > >
> >
> https://lists.apache.org/thread.html/r889d7532cf1e1eff74b072b4e642762ad39f4008caccef5ecde5b26e%40%3Cdev.arrow.apache.org%3E
> > > > [2]: https://github.com/apache/arrow/pull/7535
> > > >
> >
>


Re: Missing artifact io.netty:netty-transport-native-unix-common:jar

2020-06-23 Thread Ryan Murray
Hey Talal,

Try adding the `kr.motd.maven` extension as per
https://github.com/rymurr/flight-spark-source/blob/master/pom.xml#L34

That should populate the ${os.detected.arch} variable required to pull in
the netty native dependencies.

Best,

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://dremio.com>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


On Tue, Jun 23, 2020 at 12:44 PM Talal Zahid  wrote:

> Hello,
> I have been trying to use apache arrow flight. But I am having trouble
> using it.
> It gives me following error:
>
> Missing artifact io.netty:netty-transport-native-unix-common:jar:${
> os.detected.name}-${os.detected.arch}:4.1.27.Final pom.xml
> 
> 
> org.apache.arrow
> arrow-flight
> 0.15.1
> 
>
> Thank you.
>
> Regards,
> Talal Zahid
>


[jira] [Created] (ARROW-9016) [Java] Remove direct references to Netty/Unsafe Allocators

2020-06-02 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9016:
--

 Summary: [Java] Remove direct references to Netty/Unsafe Allocators
 Key: ARROW-9016
 URL: https://issues.apache.org/jira/browse/ARROW-9016
 Project: Apache Arrow
  Issue Type: Task
Reporter: Ryan Murray


As part of ARROW-8230 this removes direct references to Netty and Unsafe 
Allocation managers in the `DefaultAllocationManagerOption`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9015) [Java] Make BaseBuffer package private

2020-06-02 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9015:
--

 Summary: [Java] Make BaseBuffer package private
 Key: ARROW-9015
 URL: https://issues.apache.org/jira/browse/ARROW-9015
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


As part of the netty work in ARROW-8230 it became clear that BaseAllocator 
should be package private



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8948) [Java][Integration] enable duplicate field names integration tests

2020-05-26 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8948:
--

 Summary: [Java][Integration] enable duplicate field names 
integration tests
 Key: ARROW-8948
 URL: https://issues.apache.org/jira/browse/ARROW-8948
 Project: Apache Arrow
  Issue Type: Bug
  Components: Integration, Java
Reporter: Ryan Murray
Assignee: Ryan Murray






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8947) [Java] MapWithOrdinal javadoc doesn't describe actual behaviour

2020-05-26 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8947:
--

 Summary: [Java] MapWithOrdinal javadoc doesn't describe actual 
behaviour
 Key: ARROW-8947
 URL: https://issues.apache.org/jira/browse/ARROW-8947
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


MapWithOrdinal states that ordinals are recycled when keys are removed, it does 
not currently do this and grows unbounded



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow Flight connector for SQL Server

2020-05-22 Thread Ryan Murray
Hey Brendan,

As Jacques promised here are a few things to act as pointers for your work
on Flight:
Our early release Flight connector[1]  this fully supports single flight
streams and partially supports parallel streams
I also have a Spark DataSourceV2 client which may be of interest to you[2]

Both links make use of the 'doAction' part of the Flight API spec[3] to
negotiate parallel vs single stream among other things. However, this is
done in an ad-hoc manner and finding a way to standardise this for exchange
of metadata, catalog info, connection parameters etc is for me an important
next step to making a flight based protocol that is equivalent to
odbc/jdbc. I would be happy to discuss further if you have any thoughts on
the topic.

Best,
Ryan

[1] https://github.com/dremio-hub/dremio-flight-connector
[2] https://github.com/rymurr/flight-spark-source
[3] https://github.com/apache/arrow/blob/master/format/Flight.proto

On Thu, May 21, 2020 at 3:08 PM Uwe L. Korn  wrote:

> Hello Brendan,
>
> welcome to the community. In addition to the folks at Dremio, I wanted to
> make you aware of the Python ODBC client library
> https://github.com/blue-yonder/turbodbc which provides a high-performance
> ODBC<->Arrow adapter. It is especially popular with MS SQL Server users as
> the fastest known way to retrieve query results as DataFrames in Python
> from SQL Server, considerably faster than pandas.read_sql or using pyodbc
> directly.
>
> While being the fastest known, I can tell that still there is a lot time
> CPU spent in the ODBC driver "transforming" results so that it matches the
> ODBC interface. At least here, one could get possibly a lot better
> performance when retrieving large columnar results from SQL Server when
> going through Arrow Flight as an interface instead being constraint to the
> less efficient ODBC for this use case. Currently there is a performance
> difference of 50x between reading the data from a Parquet file and reading
> the same data from a table in SQL Server (simple SELECT, no filtering or
> so). As nearly for the full retrieval time the client CPU is at 100%, using
> a more efficient protocol for data transferral could roughly translate into
> a 10x speedup.
>
> Best,
> Uwe
>
> On Wed, May 20, 2020, at 12:16 AM, Brendan Niebruegge wrote:
> > Hi everyone,
> >
> > I wanted to informally introduce myself. My name is Brendan Niebruegge,
> > I'm a Software Engineer in our SQL Server extensibility team here at
> > Microsoft. I am leading an effort to explore how we could integrate
> > Arrow Flight with SQL Server. We think this could be a very interesting
> > integration that would both benefit SQL Server and the Arrow community.
> > We are very early in our thoughts so I thought it best to reach out
> > here and see if you had any thoughts or suggestions for me. What would
> > be the best way to socialize my thoughts to date? I am keen to learn
> > and deepen my knowledge of Arrow as well so please let me know how I
> > can be of help to the community.
> >
> > Please feel free to reach out anytime (email:brn...@microsoft.com)
> >
> > Thanks,
> > Brendan Niebruegge
> >
> >
>


Re: Sparse Union format

2020-05-19 Thread Ryan Murray
Thanks for the clarification! Next time I will read the whole document ;-)

On Tue, May 19, 2020 at 2:38 PM Antoine Pitrou  wrote:

>
> As explained in the comment below:
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L91
>
> Regards
>
> Antoine.
>
>
> Le 19/05/2020 à 14:14, Ryan Murray a écrit :
> > Thanks Antoine,
> >
> > Can you just clarify what you mean by 'type ids are logical'? In my mind
> > type ids are strongly coupled to the types and their order in Schema.fbs
> > [1]. Do you mean that the order there is only a convention and we can't
> > assume that 0 === Null?
> >
> > Best,
> > Ryan
> >
> > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235
> >
> > On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 19/05/2020 à 13:43, Ryan Murray a écrit :
> >>> Hey All,
> >>>
> >>> While working on https://issues.apache.org/jira/browse/ARROW-1692 I
> >> noticed
> >>> that there is a difference between C++ and Java on the way Sparse
> Unions
> >>> are handled. I haven't seen in the format spec which the correct is so
> I
> >>> wanted to check with the wider community.
> >>>
> >>> c++ (and the integration tests) see sparse unions as:
> >>> name
> >>> count
> >>> VALIDITY[]
> >>> TYPE_ID[]
> >>> children[]
> >>>
> >>> and java as:
> >>> name
> >>> count
> >>> TYPE[]
> >>> children[]
> >>>
> >>> The precise names may only be important for json reading/writing in the
> >>> integration tests so I will ignore TYPE/TYPE_ID for now. However, the
> big
> >>> difference is that Java doesn't have a validity buffer and c++ does. My
> >>> understanding is thta technically the validity buffer is redundant (0
> >> type
> >>> == NULL) so I can see why Java would omit it. My question is then:
> which
> >>> language is 'correct'?
> >>
> >> Union type ids are logical, so 0 could very well be a valid type id.
> >> You can't assume that type 0 means a null entry.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>


Re: Sparse Union format

2020-05-19 Thread Ryan Murray
Thanks Antoine,

Can you just clarify what you mean by 'type ids are logical'? In my mind
type ids are strongly coupled to the types and their order in Schema.fbs
[1]. Do you mean that the order there is only a convention and we can't
assume that 0 === Null?

Best,
Ryan

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235

On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou  wrote:

>
> Le 19/05/2020 à 13:43, Ryan Murray a écrit :
> > Hey All,
> >
> > While working on https://issues.apache.org/jira/browse/ARROW-1692 I
> noticed
> > that there is a difference between C++ and Java on the way Sparse Unions
> > are handled. I haven't seen in the format spec which the correct is so I
> > wanted to check with the wider community.
> >
> > c++ (and the integration tests) see sparse unions as:
> > name
> > count
> > VALIDITY[]
> > TYPE_ID[]
> > children[]
> >
> > and java as:
> > name
> > count
> > TYPE[]
> > children[]
> >
> > The precise names may only be important for json reading/writing in the
> > integration tests so I will ignore TYPE/TYPE_ID for now. However, the big
> > difference is that Java doesn't have a validity buffer and c++ does. My
> > understanding is thta technically the validity buffer is redundant (0
> type
> > == NULL) so I can see why Java would omit it. My question is then: which
> > language is 'correct'?
>
> Union type ids are logical, so 0 could very well be a valid type id.
> You can't assume that type 0 means a null entry.
>
> Regards
>
> Antoine.
>


Sparse Union format

2020-05-19 Thread Ryan Murray
Hey All,

While working on https://issues.apache.org/jira/browse/ARROW-1692 I noticed
that there is a difference between C++ and Java on the way Sparse Unions
are handled. I haven't seen in the format spec which the correct is so I
wanted to check with the wider community.

c++ (and the integration tests) see sparse unions as:
name
count
VALIDITY[]
TYPE_ID[]
children[]

and java as:
name
count
TYPE[]
children[]

The precise names may only be important for json reading/writing in the
integration tests so I will ignore TYPE/TYPE_ID for now. However, the big
difference is that Java doesn't have a validity buffer and c++ does. My
understanding is thta technically the validity buffer is redundant (0 type
== NULL) so I can see why Java would omit it. My question is then: which
language is 'correct'?

I suppose the actual language implementation is not entirely relevant here,
instead correct refers to what the canonical IPC schema for a sparse union
should be.

Best,
Ryan


[jira] [Created] (ARROW-8739) Standardise Logger naming

2020-05-08 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8739:
--

 Summary: Standardise Logger naming
 Key: ARROW-8739
 URL: https://issues.apache.org/jira/browse/ARROW-8739
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ryan Murray


As per: https://github.com/apache/arrow/pull/7100#discussion_r421884919

We use LOGGER and logger interchangeably and should choose one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8738) Investigate adding a getUnsafe method to vectors

2020-05-08 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8738:
--

 Summary: Investigate adding a getUnsafe method to vectors
 Key: ARROW-8738
 URL: https://issues.apache.org/jira/browse/ARROW-8738
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ryan Murray


As per: https://github.com/apache/arrow/pull/7095#issuecomment-625579459



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8696) [Java] Convert tests to integration tests

2020-05-04 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8696:
--

 Summary: [Java] Convert tests to integration tests
 Key: ARROW-8696
 URL: https://issues.apache.org/jira/browse/ARROW-8696
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


Some tests under arrow-memory and arrow-vector are integration tests but run 
via main(). We should convert them to proper integration tests under maven 
failsafe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8695) [Java] remove references to NettyUtils in memory module

2020-05-04 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8695:
--

 Summary: [Java] remove references to NettyUtils in memory module
 Key: ARROW-8695
 URL: https://issues.apache.org/jira/browse/ARROW-8695
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


Part of breaking ARROW-8230 into smaller chucks. First remove NettyUtils 
references from 'pure' arrow-memory classes before breaking netty classes out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8687) [Java] Finish move of io.netty.buffer.ArrowBuf

2020-05-04 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8687:
--

 Summary: [Java] Finish move of io.netty.buffer.ArrowBuf
 Key: ARROW-8687
 URL: https://issues.apache.org/jira/browse/ARROW-8687
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


Strangely a few tests that reference io.netty.buffer.ArrowBuf weren't picked up 
in the automated tests. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8664) Add isSet skip check to all Vector types

2020-05-01 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8664:
--

 Summary: Add isSet skip check to all Vector types
 Key: ARROW-8664
 URL: https://issues.apache.org/jira/browse/ARROW-8664
 Project: Apache Arrow
  Issue Type: Task
Reporter: Ryan Murray
Assignee: Ryan Murray


ARROW-5290 added a flag to disable null checks in Vector get methods. However 

* DuractionVector
* FixedSizeBinaryVector
* TimeStampMicroTZVector
* TimeStampMicroVector
* TimeStampMilliTZVector
* TimeStampMilliVector
* TimeStampNanoTZVector
* TimeStampNanoVector
* TimeStampSecTZVector
* TimeStampSecVector
* VarCharVector
*  VarBinaryVector 

were missed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Accept "DoExchange" RPC to Arrow Flight protocol

2020-03-28 Thread Ryan Murray
+1 non-binding



On Sat, Mar 28, 2020 at 1:44 AM Wes McKinney  wrote:

> Hello,
>
> David M Li has proposed adding a "bidirectional" DoExchange RPC [1] to
> the Arrow Flight Protocol [2]. In this client call, datasets (possibly
> having different schemas) are sent by both the
> client and server in a single transaction. This can be used to offload
> computational tasks and other workloads not currently well-supported
> by the Flight protocol.
>
> Please vote whether to accept the addition. The vote will be open for
> at least 72 hours (since it's Friday, it'll be open for a good deal
> longer than 72 hours).
>
> [ ] +1 Accept this addition to the Flight protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
>
> Here is my vote: +1
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/6686
>


[jira] [Created] (ARROW-8183) [c++][FlightRPC] Expose transport error metadata

2020-03-22 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8183:
--

 Summary: [c++][FlightRPC] Expose transport error metadata
 Key: ARROW-8183
 URL: https://issues.apache.org/jira/browse/ARROW-8183
 Project: Apache Arrow
  Issue Type: Task
  Components: FlightRPC, Java
Reporter: Ryan Murray
Assignee: Ryan Murray
 Fix For: 0.17.0


Flight currently do not make the gRPC trailer available when a 
FlightRuntimeException is called. This precludes using the richer error model 
discussed in [1][2]. The metadata in the gRPC trailer should be exposed in a 
protocol agnostic way. 

This task is to expose error metadata without exposing transport implementation 
specifics.

[1] https://grpc.io/docs/guides/error/#richer-error-model
[2] https://cloud.google.com/apis/design/errors#error_model



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8181) [Java][FlightRPC] Expose transport error metadata

2020-03-21 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-8181:
--

 Summary: [Java][FlightRPC] Expose transport error metadata
 Key: ARROW-8181
 URL: https://issues.apache.org/jira/browse/ARROW-8181
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ryan Murray
Assignee: Ryan Murray


Flight currently do not make the gRPC trailer available when a 
FlightRuntimeException is called. This precludes using the richer error model 
discussed in [1][2]. The metadata in the gRPC trailer should be exposed in a 
protocol agnostic way. 

This task is to expose error metadata without exposing transport implementation 
specifics.

[1] https://grpc.io/docs/guides/error/#richer-error-model
[2] https://cloud.google.com/apis/design/errors#error_model



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Java] PR Reviewers

2020-01-27 Thread Ryan Murray
Hey all, I would love to help out. Is there any specific ones that are
relatively easy for me to get started on?

On Mon, 27 Jan 2020, 18:31 Bryan Cutler,  wrote:

> Hi Micah, I don't have a ton of bandwidth at the moment, but I'll try to
> review some more PRs. Anyone, please feel free to ping me too if you have a
> stale PR that needs some help getting through. Outreach to other Java
> communities sounds like a good idea - more Java users would definitely be a
> good thing!
>
> Bryan
>
> On Mon, Jan 27, 2020 at 8:12 AM Andy Grove  wrote:
>
> > I've now started working with the Java implementation of Arrow,
> > specifically Flight, and would be happy to help although I do have
> limited
> > time each week. I can at least review from a Java correctness point of
> > view.
> >
> > Andy.
> >
> > On Thu, Jan 23, 2020 at 9:41 PM Micah Kornfield 
> > wrote:
> >
> > > I mentioned this elsewhere but my intent is to stop doing java reviews
> > for
> > > the immediate future once I wrap up the few that I have requested
> change
> > > on.
> > >
> > > I'm happy to try to triage incoming Java PRs, but in order to do this,
> I
> > > need to know which committers have some bandwidth to do reviews (some
> of
> > > the existing PRs I've tagged people who never responded).
> > >
> > > Thanks,
> > > Micah
> > >
> >
>


Re: Horizontal scaling design suggestion: Apache arrow flight

2019-10-18 Thread Ryan Murray
Hey Vinay,

This Spark source might be of interest [1]. We had discussed the
possibility of it being moved into Arrow proper as a contrib module when
more stable.

This is doing something similar to what you are suggesting: talking to a
cluster of Flight servers from Spark. This deals more with the client side
and less with the server side however. It talks to a single Flight
'coordinator' and uses getSchema/getFlightInfo to tell the coordinator it
wants a particular dataset. The coordinator then gives a list of flight
tickets with portions of the required datasets. A client can a) ask for the
entire dataset from the coordinator b) iterate serially through the tickets
and assemble the whole dataset on the client side or (in the case of the
Spark connector) fetch tickets in parallel.

I think the server side as you described above doesn't yet exist in a
standalone form although the spark connector was developed in conjunction
with [2] as the server. This is however highly dependent on the
implementation details of the Dremio engine as it is taking care of the
coordination between the flight workers. The idea is identical to yours
however: a coordinator engine, a distributed store for engine meta, worker
engines which create/serve the Arrow buffers.

Would be happy to discuss further if you are interested in working on this
stuff!

Best,
Ryan

[1] https://github.com/rymurr/flight-spark-source
[2] https://github.com/dremio-hub/dremio-flight-connector

On Fri, Oct 18, 2019 at 3:05 PM Vinay Kesarwani 
wrote:

> Hi,
>
> I am trying to establish following architecture
>
> My approach for flight horizontal scaling is to launch
> 1-Apache flight server in each node
> 2-one node declared as coordinator
> 3-Publish coordinator info to a shared service [zookeeper]
> 4-Launch worker node --> get coordinator node info from [zookeeper]
> 5-Worker publishes its info to [zookeeper] to consumed by others
>
> Client connects to coordinator:
> 1- Calls getFlightInfo(desc)
> 2-Here Co-coordinator node overrides getFlightInfo()
> 3-getFlightInfo() method internally get worker info based on the descriptor
> from zookeeper
> 4-Client consumes data from each endpoint in iterative manner OR in
> parallel[not sure how]
> -->getData()
>
> PutData()
> 5-Client calls putdata() to put data in different nodes in flight stream
> 6-Iterate through the endpoints and matches worker node IP
> 7-if Worker IP matches with endpoint; worker put data in that node flight
> server.
> 8-On putting any new stream/updated; worker node info is updated in
> zookeeper
> 9-In case worker IP doesn't match with the endpoint we need to put data in
> any other worker node; and publish the info in zookeeper.
>
> [in future distributed-client and distributed end point] example: spark
> workers to Apache arrow flight cluster
>
> [image: image]
> <
> https://user-images.githubusercontent.com/6141965/67092386-b0012c00-f1cc-11e9-9ce2-d657001a85f7.png
> >
>
> Just wanted to discuss if any PR is in progress for horizontal scaling in
> Arrow flight, or any design doc is under discussion.
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-10-15 Thread Ryan Murray
Cool, makes a ton of sense now. Thanks!

On Tue, Oct 15, 2019 at 3:11 PM David Li  wrote:

> Hey Ryan,
>
> Thanks for the comments.
>
> Concrete example: I've edited the doc to provide a Python strawman.
>
> Sync vs async: while I don't touch on it, you could interleave uploads
> and downloads if you were so inclined. Right now, synchronous APIs
> make this error-prone, e.g. if both client and server wait for each
> other due to an application logic bug. (gRPC doesn't give us the
> ability to have per-read timeouts, only an overall timeout.) As an
> example of this happening with DoPut, see ARROW-6063:
> https://issues.apache.org/jira/browse/ARROW-6063
>
> This is mostly tangential though, eventually we will want to design
> asynchronous APIs for Flight as a whole. A bidirectional stream like
> this (and like DoPut) just makes these pitfalls easier to run into.
>
> Using DoPut+DoGet: I discussed this in the proposal, but the main
> concern is that depending on how you deploy, two separate calls could
> get routed to different instances. Additionally, gRPC has some
> reconnection behaviors; if the server goes away in between the two
> calls, but it then restarts or there is another instance available,
> the client will happily reconnect to the new server without warning.
>
> Thanks,
> David
>
> On 10/15/19, Ryan Murray  wrote:
> > Hey David,
> >
> > I think this proposal makes a lot of sense. I like it and the possibility
> > of remote compute via arrow buffers. One thing that would help me would
> be
> > a concrete example of the API in a real life use case. Also, what would
> the
> > client experience be in terms of sync vs asyc? Would the client block
> till
> > the bidirectional call return ie c = flight.vector_mult(a, b) or would
> the
> > client wait to be signaled that computation was done. If the later how is
> > that different from a DoPut then DoGet? I suppose that this could be
> > implemented without extending the RPC interface but rather by a
> > function/util?
> >
> >
> > Best,
> >
> > Ryan
> >
> > On Sun, Oct 13, 2019 at 9:24 PM David Li  wrote:
> >
> >> Hi all,
> >>
> >> We've been using Flight quite successfully so far, but we have
> >> identified a new use case on the horizon: being able to both send and
> >> retrieve Arrow data within a single RPC call. To that end, I've
> >> written up a proposal for a new RPC method:
> >>
> >>
> https://docs.google.com/document/d/1Hh-3Z0hK5PxyEYFxwVxp77jens3yAgC_cpp0TGW-dcw/edit?usp=sharing
> >>
> >> Please let me know if you can't view or comment on the document. I'd
> >> appreciate any feedback; I think this is a relatively straightforward
> >> addition - it is essentially "DoPutThenGet".
> >>
> >> This is a format change and would require a vote. I've decided to
> >> table the other format change I had proposed (on DoPut), as it doesn't
> >> functionally change Flight, just the interpretation of the semantics.
> >>
> >> Thanks,
> >> David
> >>
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | rym...@dremio.com
> >
> > <https://www.dremio.com/>
> > Check out our GitHub <https://www.github.com/dremio>, join our community
> > site <https://community.dremio.com/> & Download Dremio
> > <https://www.dremio.com/download>
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-10-15 Thread Ryan Murray
Hey David,

I think this proposal makes a lot of sense. I like it and the possibility
of remote compute via arrow buffers. One thing that would help me would be
a concrete example of the API in a real life use case. Also, what would the
client experience be in terms of sync vs asyc? Would the client block till
the bidirectional call return ie c = flight.vector_mult(a, b) or would the
client wait to be signaled that computation was done. If the later how is
that different from a DoPut then DoGet? I suppose that this could be
implemented without extending the RPC interface but rather by a
function/util?


Best,

Ryan

On Sun, Oct 13, 2019 at 9:24 PM David Li  wrote:

> Hi all,
>
> We've been using Flight quite successfully so far, but we have
> identified a new use case on the horizon: being able to both send and
> retrieve Arrow data within a single RPC call. To that end, I've
> written up a proposal for a new RPC method:
>
> https://docs.google.com/document/d/1Hh-3Z0hK5PxyEYFxwVxp77jens3yAgC_cpp0TGW-dcw/edit?usp=sharing
>
> Please let me know if you can't view or comment on the document. I'd
> appreciate any feedback; I think this is a relatively straightforward
> addition - it is essentially "DoPutThenGet".
>
> This is a format change and would require a vote. I've decided to
> table the other format change I had proposed (on DoPut), as it doesn't
> functionally change Flight, just the interpretation of the semantics.
>
> Thanks,
> David
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Re: [ANNOUNCE] New Arrow committer: David M Li

2019-08-31 Thread Ryan Murray
Congratulations David!

On Sat, 31 Aug 2019, 03:56 Micah Kornfield,  wrote:

> Congrats David, well desrved.
>
> On Fri, Aug 30, 2019 at 2:02 PM Bryan Cutler  wrote:
>
> > Congrats David!
> >
> > On Fri, Aug 30, 2019 at 10:19 AM Antoine Pitrou 
> > wrote:
> >
> > >
> > > Congratulations David and welcome to the team  :-)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 30/08/2019 à 18:21, Wes McKinney a écrit :
> > > > On behalf of the Arrow PMC I'm happy to announce that David has
> > > > accepted an invitation to become an Arrow committer!
> > > >
> > > > Welcome, and thank you for your contributions!
> > > >
> > >
> >
>


Re: [VOTE] Proposed addition to Arrow Flight Protocol

2019-08-20 Thread Ryan Murray
Thanks Micah/All!

On Tue, Aug 20, 2019 at 6:06 AM Micah Kornfield 
wrote:

> The motion carries with 4 binding +1 votes,  2 non-binding +1 votes  and no
> other votes.
>
> I think the next step is to review and merge the patch pending patch [1].
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/4980
>
>
>
> On Mon, Aug 19, 2019 at 2:52 AM Antoine Pitrou  wrote:
>
> >
> > +1 (binding)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 16/08/2019 à 07:44, Micah Kornfield a écrit :
> > > Hello,
> > > Ryan Murray has proposed adding a GetFlightSchema RPC [1] to the Arrow
> > > Flight Protocol [2].  The purpose of this RPC is to allow decoupling
> > schema
> > > and endpoint retrieval as provided by the GetFlightInfo RPC.  The new
> > > definition provided is:
> > >
> > > message SchemaResult {
> > >   // Serialized Flatbuffer Schema message.
> > >   bytes schema = 1;
> > > }
> > > rpc GetSchema(FlightDescriptor) returns (SchemaResult) {}
> > >
> > > Ryan has also provided a PR demonstrating implementation of the new RPC
> > [3]
> > > in Java, C++ and Python which can be reviewed and merged after this
> > > addition is approved.
> > >
> > > Please vote whether to accept the addition. The vote will be open for
> at
> > > least 72 hours.
> > >
> > > [ ] +1 Accept this addition to the Flight protocol
> > > [ ] +0
> > > [ ] -1 Do not accept the changes because...
> > >
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> >
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit
> > > [2] https://github.com/apache/arrow/blob/master/format/Flight.proto
> > > [3] https://github.com/apache/arrow/pull/4980
> > >
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-16 Thread Ryan Murray
Thanks both :-)

On Fri, Aug 16, 2019 at 6:22 AM Micah Kornfield 
wrote:

> I'll start one shortly.
>
> On Thu, Aug 15, 2019 at 4:31 PM Wes McKinney  wrote:
>
> > Yes, I think having a vote as a procedural matter would be a good thing.
> >
> > I have run dozens of public and private votes in my role as a PMC
> > member. I would appreciate if another PMC would assist with this vote.
> >
> > Thanks
> >
> > On Wed, Aug 14, 2019 at 5:37 PM Ryan Murray  wrote:
> > >
> > > Hi All,
> > >
> > > Does this require a vote? If yes what is the process for initiating
> one &
> > > if no I hope this is enough time for feedback and I would like to
> remove
> > > the draft designation from the PR
> > >
> > > Best,
> > > Ryan
> > >
> > > On Wed, Aug 7, 2019 at 9:31 AM Ryan Murray  wrote:
> > >
> > > > As per everyone's feedback I have renamed GetFlightSchema ->
> GetSchema
> > and
> > > > have removed the descriptor on the rpc result message. The doc has
> been
> > > > updated as has the draft PR
> > > >
> > > > On Thu, Aug 1, 2019 at 6:32 PM Bryan Cutler 
> wrote:
> > > >
> > > >> Sounds good to me, I would just echo what others have said.
> > > >>
> > > >> On Thu, Aug 1, 2019 at 8:17 AM Ryan Murray 
> wrote:
> > > >>
> > > >> > Thanks Wes,
> > > >> >
> > > >> > The descriptor is only there to maintain a bit of symmetry with
> > > >> > GetFlightInfo. Happy to remove it, I don't think its necessary and
> > > >> already
> > > >> > a few people agree. Similar with the method name, I am neutral to
> > the
> > > >> > naming and can call it whatever the community is happy with.
> > > >> >
> > > >> > Best,
> > > >> > Ryan
> > > >> >
> > > >> > On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney 
> > > >> wrote:
> > > >> >
> > > >> > > I'm generally supporting of adding the new RPC endpoint.
> > > >> > >
> > > >> > > To make a couple points from the document
> > > >> > >
> > > >> > > * I'm not sure what the purpose of returning the
> FlightDescriptor
> > is,
> > > >> > > but I haven't thought too hard about it
> > > >> > > * The Schema consists of a single IPC message -- dictionaries
> will
> > > >> > > appear in the actual DoGet stream. To motivate why this is --
> > > >> > > different endpoints might have different dictionaries
> > corresponding to
> > > >> > > fields in the schema, to have static/constant dictionaries in a
> > > >> > > distributed Flight setting is likely to be impractical. I
> > summarize
> > > >> > > the issue as "dictionaries are data, not metadata".
> > > >> > > * I would be OK calling this GetSchema instead of
> GetFlightSchema
> > but
> > > >> > > either is okay
> > > >> > >
> > > >> > > - Wes
> > > >> > >
> > > >> > > On Thu, Aug 1, 2019 at 8:08 AM David Li 
> > > >> wrote:
> > > >> > > >
> > > >> > > > Hi Ryan,
> > > >> > > >
> > > >> > > > Thanks for writing this up! I made a couple of minor comments
> > in the
> > > >> > > > doc/implementation, but overall I'm in favor of having this
> RPC
> > > >> > > > method.
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > David
> > > >> > > >
> > > >> > > > On 8/1/19, Ryan Murray  wrote:
> > > >> > > > > Hi All,
> > > >> > > > >
> > > >> > > > > Please see the attached document for a proposed addition to
> > the
> > > >> > Flight
> > > >> > > > > RPC[1]. This is the result of a previous mailing list
> > > >> discussion[2].
> > > >> > > > >
> > > >> > > > > I have created the Pull Request[3] to make the proposal a
> > little
> > > >> more
> > > >> > > > 

Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-14 Thread Ryan Murray
Hi All,

Does this require a vote? If yes what is the process for initiating one &
if no I hope this is enough time for feedback and I would like to remove
the draft designation from the PR

Best,
Ryan

On Wed, Aug 7, 2019 at 9:31 AM Ryan Murray  wrote:

> As per everyone's feedback I have renamed GetFlightSchema -> GetSchema and
> have removed the descriptor on the rpc result message. The doc has been
> updated as has the draft PR
>
> On Thu, Aug 1, 2019 at 6:32 PM Bryan Cutler  wrote:
>
>> Sounds good to me, I would just echo what others have said.
>>
>> On Thu, Aug 1, 2019 at 8:17 AM Ryan Murray  wrote:
>>
>> > Thanks Wes,
>> >
>> > The descriptor is only there to maintain a bit of symmetry with
>> > GetFlightInfo. Happy to remove it, I don't think its necessary and
>> already
>> > a few people agree. Similar with the method name, I am neutral to the
>> > naming and can call it whatever the community is happy with.
>> >
>> > Best,
>> > Ryan
>> >
>> > On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney 
>> wrote:
>> >
>> > > I'm generally supporting of adding the new RPC endpoint.
>> > >
>> > > To make a couple points from the document
>> > >
>> > > * I'm not sure what the purpose of returning the FlightDescriptor is,
>> > > but I haven't thought too hard about it
>> > > * The Schema consists of a single IPC message -- dictionaries will
>> > > appear in the actual DoGet stream. To motivate why this is --
>> > > different endpoints might have different dictionaries corresponding to
>> > > fields in the schema, to have static/constant dictionaries in a
>> > > distributed Flight setting is likely to be impractical. I summarize
>> > > the issue as "dictionaries are data, not metadata".
>> > > * I would be OK calling this GetSchema instead of GetFlightSchema but
>> > > either is okay
>> > >
>> > > - Wes
>> > >
>> > > On Thu, Aug 1, 2019 at 8:08 AM David Li 
>> wrote:
>> > > >
>> > > > Hi Ryan,
>> > > >
>> > > > Thanks for writing this up! I made a couple of minor comments in the
>> > > > doc/implementation, but overall I'm in favor of having this RPC
>> > > > method.
>> > > >
>> > > > Best,
>> > > > David
>> > > >
>> > > > On 8/1/19, Ryan Murray  wrote:
>> > > > > Hi All,
>> > > > >
>> > > > > Please see the attached document for a proposed addition to the
>> > Flight
>> > > > > RPC[1]. This is the result of a previous mailing list
>> discussion[2].
>> > > > >
>> > > > > I have created the Pull Request[3] to make the proposal a little
>> more
>> > > > > concrete.
>> > > > > <https://www.dremio.com/download>
>> > > > > Please let me know if you have any questions or concerns.
>> > > > >
>> > > > > Best,
>> > > > > Ryan
>> > > > >
>> > > > > [1]:
>> > > > >
>> > >
>> >
>> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
>> > > > > [2]:
>> > > > >
>> > >
>> >
>> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
>> > > > > [3]: https://github.com/apache/arrow/pull/4980
>> > > > >
>> > >
>> >
>> >
>> > --
>> >
>> > Ryan Murray  | Principal Consulting Engineer
>> >
>> > +447540852009 | rym...@dremio.com
>> >
>> > <https://www.dremio.com/>
>> > Check out our GitHub <https://www.github.com/dremio>, join our
>> community
>> > site <https://community.dremio.com/> & Download Dremio
>> > <https://www.dremio.com/download>
>> >
>>
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rym...@dremio.com
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-07 Thread Ryan Murray
As per everyone's feedback I have renamed GetFlightSchema -> GetSchema and
have removed the descriptor on the rpc result message. The doc has been
updated as has the draft PR

On Thu, Aug 1, 2019 at 6:32 PM Bryan Cutler  wrote:

> Sounds good to me, I would just echo what others have said.
>
> On Thu, Aug 1, 2019 at 8:17 AM Ryan Murray  wrote:
>
> > Thanks Wes,
> >
> > The descriptor is only there to maintain a bit of symmetry with
> > GetFlightInfo. Happy to remove it, I don't think its necessary and
> already
> > a few people agree. Similar with the method name, I am neutral to the
> > naming and can call it whatever the community is happy with.
> >
> > Best,
> > Ryan
> >
> > On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney  wrote:
> >
> > > I'm generally supporting of adding the new RPC endpoint.
> > >
> > > To make a couple points from the document
> > >
> > > * I'm not sure what the purpose of returning the FlightDescriptor is,
> > > but I haven't thought too hard about it
> > > * The Schema consists of a single IPC message -- dictionaries will
> > > appear in the actual DoGet stream. To motivate why this is --
> > > different endpoints might have different dictionaries corresponding to
> > > fields in the schema, to have static/constant dictionaries in a
> > > distributed Flight setting is likely to be impractical. I summarize
> > > the issue as "dictionaries are data, not metadata".
> > > * I would be OK calling this GetSchema instead of GetFlightSchema but
> > > either is okay
> > >
> > > - Wes
> > >
> > > On Thu, Aug 1, 2019 at 8:08 AM David Li  wrote:
> > > >
> > > > Hi Ryan,
> > > >
> > > > Thanks for writing this up! I made a couple of minor comments in the
> > > > doc/implementation, but overall I'm in favor of having this RPC
> > > > method.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 8/1/19, Ryan Murray  wrote:
> > > > > Hi All,
> > > > >
> > > > > Please see the attached document for a proposed addition to the
> > Flight
> > > > > RPC[1]. This is the result of a previous mailing list
> discussion[2].
> > > > >
> > > > > I have created the Pull Request[3] to make the proposal a little
> more
> > > > > concrete.
> > > > > <https://www.dremio.com/download>
> > > > > Please let me know if you have any questions or concerns.
> > > > >
> > > > > Best,
> > > > > Ryan
> > > > >
> > > > > [1]:
> > > > >
> > >
> >
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> > > > > [2]:
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
> > > > > [3]: https://github.com/apache/arrow/pull/4980
> > > > >
> > >
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | rym...@dremio.com
> >
> > <https://www.dremio.com/>
> > Check out our GitHub <https://www.github.com/dremio>, join our community
> > site <https://community.dremio.com/> & Download Dremio
> > <https://www.dremio.com/download>
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-01 Thread Ryan Murray
Thanks Wes,

The descriptor is only there to maintain a bit of symmetry with
GetFlightInfo. Happy to remove it, I don't think its necessary and already
a few people agree. Similar with the method name, I am neutral to the
naming and can call it whatever the community is happy with.

Best,
Ryan

On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney  wrote:

> I'm generally supporting of adding the new RPC endpoint.
>
> To make a couple points from the document
>
> * I'm not sure what the purpose of returning the FlightDescriptor is,
> but I haven't thought too hard about it
> * The Schema consists of a single IPC message -- dictionaries will
> appear in the actual DoGet stream. To motivate why this is --
> different endpoints might have different dictionaries corresponding to
> fields in the schema, to have static/constant dictionaries in a
> distributed Flight setting is likely to be impractical. I summarize
> the issue as "dictionaries are data, not metadata".
> * I would be OK calling this GetSchema instead of GetFlightSchema but
> either is okay
>
> - Wes
>
> On Thu, Aug 1, 2019 at 8:08 AM David Li  wrote:
> >
> > Hi Ryan,
> >
> > Thanks for writing this up! I made a couple of minor comments in the
> > doc/implementation, but overall I'm in favor of having this RPC
> > method.
> >
> > Best,
> > David
> >
> > On 8/1/19, Ryan Murray  wrote:
> > > Hi All,
> > >
> > > Please see the attached document for a proposed addition to the Flight
> > > RPC[1]. This is the result of a previous mailing list discussion[2].
> > >
> > > I have created the Pull Request[3] to make the proposal a little more
> > > concrete.
> > > <https://www.dremio.com/download>
> > > Please let me know if you have any questions or concerns.
> > >
> > > Best,
> > > Ryan
> > >
> > > [1]:
> > >
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> > > [2]:
> > >
> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
> > > [3]: https://github.com/apache/arrow/pull/4980
> > >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


[DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-01 Thread Ryan Murray
Hi All,

Please see the attached document for a proposed addition to the Flight
RPC[1]. This is the result of a previous mailing list discussion[2].

I have created the Pull Request[3] to make the proposal a little more
concrete.

Please let me know if you have any questions or concerns.

Best,
Ryan

[1]:
https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
[2]:
https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
[3]: https://github.com/apache/arrow/pull/4980


[jira] [Created] (ARROW-6094) Add GetFlightSchema to Flight RPC

2019-08-01 Thread Ryan Murray (JIRA)
Ryan Murray created ARROW-6094:
--

 Summary: Add GetFlightSchema to Flight RPC
 Key: ARROW-6094
 URL: https://issues.apache.org/jira/browse/ARROW-6094
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, FlightRPC, Java, Python
Reporter: Ryan Murray
Assignee: Ryan Murray
 Fix For: 0.15.0


Implement GetFlightSchema as per 
https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing

and 
https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Spark and Arrow Flight

2019-07-25 Thread Ryan Murray
Hey David,

Yes I am. I have a 3/4 done patch ready to go, just got busy with a few
other things. Are you hoping to use it soon? I would like to get to it this
week but its looking increasingly unlikely.

Best,
Ryan

On Thu, Jul 25, 2019 at 7:37 PM David Li  wrote:

> Hey Ryan,
>
> To follow up on this, are you planning on formally proposing the
> GetSchema() call in Flight? I think it'd be interesting to have beyond
> the Spark usecase as finding the schema may or may not be expensive
> depending on the data stream (i.e. something computed on demand might
> require data to be computed in order to get the schema), and
> separating it from GetFlightInfo means that services that "don't know"
> the schema ahead of time can still respond to that endpoint quickly.
> (We could make the change minimal by leaving the schema in FlightInfo
> and simply specifying it as best-effort.)
>
> Best,
> David
>
> On 7/10/19, Ryan Murray  wrote:
> > Hey Wes,
> >
> > Would be happy to! Jacques and I had originally thought to try and get it
> > into Spark but perhaps Arrow might be a better home. I think the only
> issue
> > is whether we want to bring Spark jars and their dependencies into Arrow.
> > One challenge I have had so far with the connector is managing the
> > transitive arrow dependencies from Spark, the connector only works on
> > relatively recent versions of Spark and potentially can create circular
> > arrow dependencies. I think this issue will be better once 1.0.0 is done
> > and we can rely on a stable format/api.
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney  wrote:
> >
> >> Hi Ryan, have you thought about developing this inside Apache Arrow?
> >>
> >> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler  wrote:
> >>
> >> > Great, thanks Ryan! I'll take a look
> >> >
> >> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray  wrote:
> >> >
> >> > > Hi Bryan,
> >> > >
> >> > > I have an implementation of option #3 nearly ready for a PR. I will
> >> > mention
> >> > > you when I publish it.
> >> > >
> >> > > The working prototype for the Spark connector is here:
> >> > > https://github.com/rymurr/flight-spark-source. It technically works
> >> (and
> >> > > is
> >> > > very fast!) however the implementation is pretty dodgy and needs to
> >> > > be
> >> > > cleaned up before ready for prime time. I plan to have it ready to
> go
> >> for
> >> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please
> >> > > shout
> >> > if
> >> > > you have any comments or are interested in contributing!
> >> > >
> >> > > Best,
> >> > > Ryan
> >> > >
> >> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler 
> >> > > wrote:
> >> > >
> >> > > > I'm in favor of option #3 also, but not sure what the best thing
> to
> >> do
> >> > > with
> >> > > > the existing FlightInfo response is. I'm definitely interested in
> >> > > > connecting Spark with Flight, can you share more details of your
> >> > > > work
> >> > or
> >> > > is
> >> > > > it planned to be open sourced?
> >> > > >
> >> > > > Thanks,
> >> > > > Bryan
> >> > > >
> >> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou  >
> >> > > wrote:
> >> > > >
> >> > > > >
> >> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> >> implementation
> >> > > can
> >> > > > > rely on calling GetFlightInfo.
> >> > > > >
> >> > > > >
> >> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> >> > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> >> > > > > >
> >> > > > > > We've been thinking about a similar issue, where sometimes we
> >> want
> >> > > > > > just the schema, but the service can't necessarily return the
> >> > schema
> >> > > > > > without fetching data - right now we return a sentinel value
> in
> >> > > > > > GetFlightInfo, but a separate RPC would let us ex

Re: Spark and Arrow Flight

2019-07-10 Thread Ryan Murray
Hey Wes,

Would be happy to! Jacques and I had originally thought to try and get it
into Spark but perhaps Arrow might be a better home. I think the only issue
is whether we want to bring Spark jars and their dependencies into Arrow.
One challenge I have had so far with the connector is managing the
transitive arrow dependencies from Spark, the connector only works on
relatively recent versions of Spark and potentially can create circular
arrow dependencies. I think this issue will be better once 1.0.0 is done
and we can rely on a stable format/api.

Best,
Ryan

On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney  wrote:

> Hi Ryan, have you thought about developing this inside Apache Arrow?
>
> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler  wrote:
>
> > Great, thanks Ryan! I'll take a look
> >
> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray  wrote:
> >
> > > Hi Bryan,
> > >
> > > I have an implementation of option #3 nearly ready for a PR. I will
> > mention
> > > you when I publish it.
> > >
> > > The working prototype for the Spark connector is here:
> > > https://github.com/rymurr/flight-spark-source. It technically works
> (and
> > > is
> > > very fast!) however the implementation is pretty dodgy and needs to be
> > > cleaned up before ready for prime time. I plan to have it ready to go
> for
> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> > if
> > > you have any comments or are interested in contributing!
> > >
> > > Best,
> > > Ryan
> > >
> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:
> > >
> > > > I'm in favor of option #3 also, but not sure what the best thing to
> do
> > > with
> > > > the existing FlightInfo response is. I'm definitely interested in
> > > > connecting Spark with Flight, can you share more details of your work
> > or
> > > is
> > > > it planned to be open sourced?
> > > >
> > > > Thanks,
> > > > Bryan
> > > >
> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > > >
> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> implementation
> > > can
> > > > > rely on calling GetFlightInfo.
> > > > >
> > > > >
> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > >
> > > > > > We've been thinking about a similar issue, where sometimes we
> want
> > > > > > just the schema, but the service can't necessarily return the
> > schema
> > > > > > without fetching data - right now we return a sentinel value in
> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> indicate
> > an
> > > > > > error.
> > > > > >
> > > > > > I might be missing something though - what happens between step 1
> > and
> > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> have
> > > the
> > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On 7/1/19, Wes McKinney  wrote:
> > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> but
> > I
> > > > > >> like the more structured solution of explicitly requesting the
> > > schema
> > > > > >> given a descriptor.
> > > > > >>
> > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> if
> > > you
> > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > receive
> > > > > >> the schema again. The schema is optional, so if it became a
> > > > > >> performance problem then a particular server might return the
> > schema
> > > > > >> as null from GetFlightInfo.
> > > > > >>
> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > request
> > > > > >> that returns _bo

Re: Spark and Arrow Flight

2019-07-09 Thread Ryan Murray
Hi Bryan,

I have an implementation of option #3 nearly ready for a PR. I will mention
you when I publish it.

The working prototype for the Spark connector is here:
https://github.com/rymurr/flight-spark-source. It technically works (and is
very fast!) however the implementation is pretty dodgy and needs to be
cleaned up before ready for prime time. I plan to have it ready to go for
the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if
you have any comments or are interested in contributing!

Best,
Ryan

On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:

> I'm in favor of option #3 also, but not sure what the best thing to do with
> the existing FlightInfo response is. I'm definitely interested in
> connecting Spark with Flight, can you share more details of your work or is
> it planned to be open sourced?
>
> Thanks,
> Bryan
>
> On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou  wrote:
>
> >
> > Either #3 or #4 for me.  If #3, the default GetSchema implementation can
> > rely on calling GetFlightInfo.
> >
> >
> > Le 01/07/2019 à 22:50, David Li a écrit :
> > > I think I'd prefer #3 over overloading an existing call (#2).
> > >
> > > We've been thinking about a similar issue, where sometimes we want
> > > just the schema, but the service can't necessarily return the schema
> > > without fetching data - right now we return a sentinel value in
> > > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > > error.
> > >
> > > I might be missing something though - what happens between step 1 and
> > > 2 that makes the endpoints available? Would it make sense to use
> > > DoAction to cause the backend to "prepare" the endpoints, and have the
> > > result of that be an encoded schema? So then the flow would be
> > > DoAction -> GetFlightInfo -> DoGet.
> > >
> > > Best,
> > > David
> > >
> > > On 7/1/19, Wes McKinney  wrote:
> > >> My inclination is either #2 or #3. #4 is an option of course, but I
> > >> like the more structured solution of explicitly requesting the schema
> > >> given a descriptor.
> > >>
> > >> In both cases, it's possible that schemas are sent twice, e.g. if you
> > >> call GetSchema and then later call GetFlightInfo and so you receive
> > >> the schema again. The schema is optional, so if it became a
> > >> performance problem then a particular server might return the schema
> > >> as null from GetFlightInfo.
> > >>
> > >> I think it's valid to want to make a single GetFlightInfo RPC request
> > >> that returns _both_ the schema and the query plan.
> > >>
> > >> Thoughts from others?
> > >>
> > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau 
> > wrote:
> > >>>
> > >>> My initial inclination is towards #3 but I'd be curious what others
> > >>> think.
> > >>> In the case of #3, I wonder if it makes sense to then pull the Schema
> > off
> > >>> the GetFlightInfo response...
> > >>>
> > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray 
> > wrote:
> > >>>
> > >>>> Hi All,
> > >>>>
> > >>>> I have been working on building an arrow flight source for spark.
> The
> > >>>> goal
> > >>>> here is for Spark to be able to use a group of arrow flight
> endpoints
> > >>>> to
> > >>>> get a dataset pulled over to spark in parallel.
> > >>>>
> > >>>> I am unsure of the best model for the spark <-> flight conversation
> > and
> > >>>> wanted to get your opinion on the best way to go.
> > >>>>
> > >>>> I am breaking up the query to flight from spark into 3 parts:
> > >>>> 1) get the schema using GetFlightInfo. This is needed to do further
> > >>>> lazy
> > >>>> operations in Spark
> > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > >>>> different
> > >>>> argument. This returns the list endpoints on the parallel flight
> > >>>> server.
> > >>>> The endpoints are not available till data is ready to be fetched,
> > which
> > >>>> is
> > >>>> done after the schema but is needed before DoGet is called.
> > >>>> 3) call get stream on all endpoints from 2
&g

Re: Flight authentication interoperability

2019-07-04 Thread Ryan Murray
Hey David,

I was actually testing test_flight.test_http_basic_auth(). But I think the
same applies. The Java default implementation expects a handshake. More to
the point it expects a BasicAuth protobuf which I believe is not exposed at
all in python. Always returning true in
BasicServerAuthHandler.authenticate() allows for the test implementations
of Java and Python authentication to speak to each other.

Thanks for the below link, that really clarified things for me. I would add
to the list that we should normalise the use of BasicAuth protobuf message
between java and cpp.

Apologies for the urgency yesterday, I am glad it was sorted and more my
fault than Arrow's.

Best,
Ryan


On Thu, Jul 4, 2019 at 11:29 AM David Li  wrote:

> Hmm, interesting. I assume you mean test_flight.test_token_auth() as
> the client? The tests weren't written to be explicitly compatible, but
> there's no reason you should get an indefinite stall.
>
> We don't use Handshake/ServerAuthHandler#authenticate, so that would
> explain why we don't see issues. I commented on this in the initial
> implementation:
> https://github.com/apache/arrow/pull/4125#discussion_r273935691
>
> > there is not a 1-1 mapping between connected clients and connected
> peers, and so you can
> > only know the identity of the peer at the moment it makes a call. Just
> doing a handshake
> > (Authenticate) isn't enough to identify who is on the other end of a
> particular connection.
>
> the reason being that a layer 7 load balancer (e.g. Envoy) means one
> gRPC connection can represent multiple clients. Conversely,
> client-side load balancing (built into gRPC) means one client-side
> "connection" can actually represent multiple servers. Of course, you
> have to consciously deploy in this manner, so Handshake is still
> useful if you know you won't ever need these features.
>
> As I see it, this means there's a few things to work on:
> - Flight RPC feature compatibility needs to be tested, not just format
> compatibility.
> - We should start thinking about async APIs and/or timeouts in any
> sort of API that makes a network call (there's already a JIRA:
> https://issues.apache.org/jira/browse/ARROW-1009), since "never
> returns" is a terrible failure mode
>
> Best,
> David
>
> On 7/4/19, Ryan Murray  wrote:
> > Hey David,
> >
> > I am curious to see what you are doing different from me. I am running
> the
> > Java ExampleFlightServer.java against the python auth flight tests and
> they
> > are not passing. The particular issue is that incoming.next() never
> returns
> > in BasicServerAuthHandler.java:56
> >
> > It doesn't appear to be anything wrong w/ the auth piece specifically
> > rather the server appears to not be getting the auth info to verify. I am
> > still investigating my issue but I am glad that someone else has gotten
> > this to work.
> >
> > Best,
> > Ryan
> >
> > On Thu, Jul 4, 2019 at 9:13 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> It may be worth opening a JIRA for the flaky tests if not already done.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 04/07/2019 à 18:11, David Li a écrit :
> >> > I'm also curious as to what the issue was, as we've been doing
> >> > Python-client-Java-server auth with development builds without
> >> > trouble.
> >> >
> >> > Regardless - this does point out a need for more cross-language Flight
> >> > testing (perhaps a Flight-specific integration suite?), and to get
> >> > existing tests running more consistently in CI (Flight/Java in
> >> > particular has a lot of flaky tests, though the auth tests are enabled
> >> > in Travis).
> >> >
> >> > Best,
> >> > David
> >> >
> >> > On 7/4/19, Jacques Nadeau  wrote:
> >> >> Which is exactly why I was withholding a vote until there was more
> >> >> information.
> >> >>
> >> >> On Thu, Jul 4, 2019, 7:25 AM Antoine Pitrou 
> >> wrote:
> >> >>
> >> >>> On Thu, 4 Jul 2019 09:04:34 -0500
> >> >>> Wes McKinney  wrote:
> >> >>>>
> >> >>>> That being said, with Ryan's issue, he is using a feature
> >> >>>> (cross-language auth in Flight) that isn't being tested. The Flight
> >> >>>> integration tests do not use authentication AFAIK so I'm not
> >> >>>> surprised
> >> >>>> to hear that there may be an issue with it.
> >> >>

Re: Flight authentication interoperability

2019-07-04 Thread Ryan Murray
Hey David,

I am curious to see what you are doing different from me. I am running the
Java ExampleFlightServer.java against the python auth flight tests and they
are not passing. The particular issue is that incoming.next() never returns
in BasicServerAuthHandler.java:56

It doesn't appear to be anything wrong w/ the auth piece specifically
rather the server appears to not be getting the auth info to verify. I am
still investigating my issue but I am glad that someone else has gotten
this to work.

Best,
Ryan

On Thu, Jul 4, 2019 at 9:13 AM Antoine Pitrou  wrote:

>
> It may be worth opening a JIRA for the flaky tests if not already done.
>
> Regards
>
> Antoine.
>
>
> Le 04/07/2019 à 18:11, David Li a écrit :
> > I'm also curious as to what the issue was, as we've been doing
> > Python-client-Java-server auth with development builds without
> > trouble.
> >
> > Regardless - this does point out a need for more cross-language Flight
> > testing (perhaps a Flight-specific integration suite?), and to get
> > existing tests running more consistently in CI (Flight/Java in
> > particular has a lot of flaky tests, though the auth tests are enabled
> > in Travis).
> >
> > Best,
> > David
> >
> > On 7/4/19, Jacques Nadeau  wrote:
> >> Which is exactly why I was withholding a vote until there was more
> >> information.
> >>
> >> On Thu, Jul 4, 2019, 7:25 AM Antoine Pitrou 
> wrote:
> >>
> >>> On Thu, 4 Jul 2019 09:04:34 -0500
> >>> Wes McKinney  wrote:
> >>>>
> >>>> That being said, with Ryan's issue, he is using a feature
> >>>> (cross-language auth in Flight) that isn't being tested. The Flight
> >>>> integration tests do not use authentication AFAIK so I'm not surprised
> >>>> to hear that there may be an issue with it.
> >>>
> >>> OTOH, it's a bit unlikely.  Flight authentication is implemented is:
> >>> - the Arrow codebase simply passes opaque tokens around
> >>> - interpretation of tokens is handled by application code
> >>> - marshalling of tokens is handled by Protocol Buffers
> >>>
> >>> So unless something silly is going on (such as "passing an empty string
> >>> instead of the actual token") there's not much potential for
> >>> auth interoperability issues in the core Flight codebase.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-03 Thread Ryan Murray
irm
> >> >> > > if that's expected.
> >> >> >
> >> >> > I think that this is caused by "-P arrow-jni" is missing in
> >> >> > 01-perform.sh:
> >> >> >
> >> >> >   https://github.com/apache/arrow/pull/4717#issuecomment-506916189
> >> >> >
> >> >> > It's intentional for RC0.
> >> >> >
> >> >> > We'll fix this after RC0:
> >> >> >
> >> >> >   https://issues.apache.org/jira/browse/ARROW-5786
> >> >> >
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > kou
> >> >> >
> >> >> > In <
> >> capwbug6wvudwu7-z8dyhq7snusagappkdzkrqof4dfnj4np...@mail.gmail.com>
> >> >> >   "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Wed, 3 Jul 2019
> >> >> > 06:55:52 +0530,
> >> >> >   Ravindra Pindikura  wrote:
> >> >> >
> >> >> > > I tried "./dev/release/verify-release-candidate.sh source 0.14.0
> 0"
> >> on
> >> >> > mac
> >> >> > > mojave.
> >> >> > >
> >> >> > > 1. I consistently get this error with flight tests
> >> >> > >
> >> >> > > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
> >> elapsed:
> >> >> > > 0.04 s <<< FAILURE! - in
> org.apache.arrow.flight.TestServerOptions
> >> >> > > [ERROR] domainSocket(org.apache.arrow.flight.TestServerOptions)
> >> Time
> >> >> > > elapsed: 0.04 s  <<< ERROR!
> >> >> > > java.io.IOException: Failed to bind
> >> >> > > at
> >> >> > >
> >> >> >
> >> >>
> >>
> org.apache.arrow.flight.TestServerOptions.domainSocket(TestServerOptions.java:46)
> >> >> > > Caused by: io.netty.channel.unix.Errors$NativeIoException:
> bind(..)
> >> >> > failed:
> >> >> > > Address already in use
> >> >> > >
> >> >> > > Is there a workaround or gotcha for this ?
> >> >> > >
> >> >> > > 2. The package doesn't seem to include gandiva
> >> >> > >
> >> >> > > is that intentional ? I'm fine if it is not included, just want
> to
> >> >> > confirm
> >> >> > > if that's expected.
> >> >> > >
> >> >> > > On Wed, Jul 3, 2019 at 6:37 AM Sutou Kouhei 
> >> >> wrote:
> >> >> > >
> >> >> > >> > I tried again (Ubuntu 18.04):
> >> >> > >> > * source verification failed in gRPC configure step:
> >> >> > >> > Problem is, Ubuntu's c-ares does not provide any CMake files:
> >> >> > >>
> >> >> > >> Note: Adding -Dc-ares_SOURCE=BUNDLED CMake option is
> >> >> > >> workaround. We can use bundled c-ares automatically by
> >> >> > >> requiring c-ares's CMake config:
> >> >> > >>
> >> >> > >>   https://github.com/apache/arrow/pull/4783
> >> >> > >>
> >> >> > >>
> >> >> > >> Thanks,
> >> >> > >> --
> >> >> > >> kou
> >> >> > >>
> >> >> > >> In <7a82e6be-f4a5-240b-389a-4cf9cd4fb...@python.org>
> >> >> > >>   "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Tue, 2 Jul
> 2019
> >> >> > >> 11:36:09 +0200,
> >> >> > >>   Antoine Pitrou  wrote:
> >> >> > >>
> >> >> > >> >
> >> >> > >> > I tried again (Ubuntu 18.04):
> >> >> > >> >
> >> >> > >> > * binaries verification succeeded
> >> >> > >> >
> >> >> > >> > * source verification failed in gRPC configure step:
> >> >> > >> >
> >> >> > >> > CMake Error at cmake/cares.cmake:38 (find_package):
> >> >> > >> >   Could not find a package configuration file provided by
> >> "c-ares"
> >> >> > with
> >> >> > >> any
> >> >> > >> >   of the following names:
> >> >> > >> >
> >> >> > >> > c-aresConfig.cmake
> >> >> > >> > c-ares-config.cmake
> >> >> > >> >
> >> >> > >> >   Add the installation prefix of "c-ares" to
> CMAKE_PREFIX_PATH or
> >> >> set
> >> >> > >> >   "c-ares_DIR" to a directory containing one of the above
> >> files.  If
> >> >> > >> "c-ares"
> >> >> > >> >   provides a separate development package or SDK, be sure it
> has
> >> >> been
> >> >> > >> >   installed.
> >> >> > >> > Call Stack (most recent call first):
> >> >> > >> >   CMakeLists.txt:139 (include)
> >> >> > >> >
> >> >> > >> >
> >> >> > >> > Problem is, Ubuntu's c-ares does not provide any CMake files:
> >> >> > >> >
> >> >> > >> > $ dpkg -L libc-ares-dev
> >> >> > >> > /.
> >> >> > >> > /usr
> >> >> > >> > /usr/include
> >> >> > >> > /usr/include/ares.h
> >> >> > >> > /usr/include/ares_build.h
> >> >> > >> > /usr/include/ares_dns.h
> >> >> > >> > /usr/include/ares_rules.h
> >> >> > >> > /usr/include/ares_version.h
> >> >> > >> > /usr/lib
> >> >> > >> > /usr/lib/x86_64-linux-gnu
> >> >> > >> > /usr/lib/x86_64-linux-gnu/libcares.a
> >> >> > >> > /usr/lib/x86_64-linux-gnu/pkgconfig
> >> >> > >> > /usr/lib/x86_64-linux-gnu/pkgconfig/libcares.pc
> >> >> > >> > /usr/share
> >> >> > >> > /usr/share/doc
> >> >> > >> > /usr/share/doc/libc-ares-dev
> >> >> > >> > /usr/share/doc/libc-ares-dev/NEWS.gz
> >> >> > >> > /usr/share/doc/libc-ares-dev/README.cares
> >> >> > >> > /usr/share/doc/libc-ares-dev/README.md
> >> >> > >> > /usr/share/doc/libc-ares-dev/copyright
> >> >> > >> > /usr/share/man
> >> >> > >> > /usr/share/man/man3
> >> >> > >> > [ snip man pages ]
> >> >> > >> > /usr/lib/x86_64-linux-gnu/libcares.so
> >> >> > >> >
> >> >> > >> >
> >> >> > >> > Regards
> >> >> > >> >
> >> >> > >> > Antoine.
> >> >> > >>
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Thanks and regards,
> >> >> > > Ravindra.
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> Thanks and regards,
> >> >> Ravindra.
> >> >>
> >>
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>


Spark and Arrow Flight

2019-06-28 Thread Ryan Murray
Hi All,

I have been working on building an arrow flight source for spark. The goal
here is for Spark to be able to use a group of arrow flight endpoints to
get a dataset pulled over to spark in parallel.

I am unsure of the best model for the spark <-> flight conversation and
wanted to get your opinion on the best way to go.

I am breaking up the query to flight from spark into 3 parts:
1) get the schema using GetFlightInfo. This is needed to do further lazy
operations in Spark
2) get the endpoints by calling GetFlightInfo a 2nd time with a different
argument. This returns the list endpoints on the parallel flight server.
The endpoints are not available till data is ready to be fetched, which is
done after the schema but is needed before DoGet is called.
3) call get stream on all endpoints from 2

I think I have to do each step however I don't like having to call getInfo
twice, it doesn't seem very elegant. I see a few options:
1) live with calling GetFlightInfo twice and with a custom bytes cmd to
differentiate the purpose of each call
2) add an argument to GetFlightInfo to tell it its being called only for
the schema
3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return just
the Schema in question
4) use DoAction and wrap the expected FlightInfo in a Result

I am aware that 4 is probably the least disruptive but I'm also not a fan
as (to me) it implies performing an action on the server side. Suggestions
2 & 3 are larger changes and I am reluctant to do that unless there is a
consensus here. None of them are great options and I am wondering what
everyone thinks the best approach might be? Particularly as I think this is
likely to come up in more applications than just spark.

Best,
Ryan


[jira] [Created] (ARROW-5249) Java Flight client doesn't handle auth correctly in some cases

2019-05-02 Thread Ryan Murray (JIRA)
Ryan Murray created ARROW-5249:
--

 Summary: Java Flight client doesn't handle auth correctly in some 
cases
 Key: ARROW-5249
 URL: https://issues.apache.org/jira/browse/ARROW-5249
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ryan Murray


The Flight Client does not add auth token to some async requests. The 
interceptor doesn't  get added when creating a new ClientCall



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Job posting

2018-06-27 Thread Ryan Murray
Hey All,

Apologies ahead of time for the potential spam. I have been working with
Jacques and co for about 6 months to bring Dremio to our front office
trading organisation at UBS. We are now expanding the team and I am looking
to hire some killer devs in London to work on Arrow, Gandiva and other
cutting edge toys.

For my money this is the most exciting team in any investment bank and its
the only one I know of where one can actually contribute to both open
source projects and to a front office trading system. Here[1] is the link
or chat me for details.

Looking forward to showing off some of our work on this list very soon.

Best,
Ryan

[1]
https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?PageType=JobDetails=176009=25008=5012=1=5012=176009_5012=88648=ILINKEDCH=#jobDetails=176009_5012