Hey Wes,

Would be happy to! Jacques and I had originally thought to try and get it
into Spark but perhaps Arrow might be a better home. I think the only issue
is whether we want to bring Spark jars and their dependencies into Arrow.
One challenge I have had so far with the connector is managing the
transitive arrow dependencies from Spark, the connector only works on
relatively recent versions of Spark and potentially can create circular
arrow dependencies. I think this issue will be better once 1.0.0 is done
and we can rely on a stable format/api.

Best,
Ryan

On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <wesmck...@gmail.com> wrote:

> Hi Ryan, have you thought about developing this inside Apache Arrow?
>
> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cutl...@gmail.com> wrote:
>
> > Great, thanks Ryan! I'll take a look
> >
> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <rym...@dremio.com> wrote:
> >
> > > Hi Bryan,
> > >
> > > I have an implementation of option #3 nearly ready for a PR. I will
> > mention
> > > you when I publish it.
> > >
> > > The working prototype for the Spark connector is here:
> > > https://github.com/rymurr/flight-spark-source. It technically works
> (and
> > > is
> > > very fast!) however the implementation is pretty dodgy and needs to be
> > > cleaned up before ready for prime time. I plan to have it ready to go
> for
> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> > if
> > > you have any comments or are interested in contributing!
> > >
> > > Best,
> > > Ryan
> > >
> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cutl...@gmail.com> wrote:
> > >
> > > > I'm in favor of option #3 also, but not sure what the best thing to
> do
> > > with
> > > > the existing FlightInfo response is. I'm definitely interested in
> > > > connecting Spark with Flight, can you share more details of your work
> > or
> > > is
> > > > it planned to be open sourced?
> > > >
> > > > Thanks,
> > > > Bryan
> > > >
> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > > >
> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> implementation
> > > can
> > > > > rely on calling GetFlightInfo.
> > > > >
> > > > >
> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > >
> > > > > > We've been thinking about a similar issue, where sometimes we
> want
> > > > > > just the schema, but the service can't necessarily return the
> > schema
> > > > > > without fetching data - right now we return a sentinel value in
> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> indicate
> > an
> > > > > > error.
> > > > > >
> > > > > > I might be missing something though - what happens between step 1
> > and
> > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> have
> > > the
> > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote:
> > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> but
> > I
> > > > > >> like the more structured solution of explicitly requesting the
> > > schema
> > > > > >> given a descriptor.
> > > > > >>
> > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> if
> > > you
> > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > receive
> > > > > >> the schema again. The schema is optional, so if it became a
> > > > > >> performance problem then a particular server might return the
> > schema
> > > > > >> as null from GetFlightInfo.
> > > > > >>
> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > request
> > > > > >> that returns _both_ the schema and the query plan.
> > > > > >>
> > > > > >> Thoughts from others?
> > > > > >>
> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> > jacq...@apache.org>
> > > > > wrote:
> > > > > >>>
> > > > > >>> My initial inclination is towards #3 but I'd be curious what
> > others
> > > > > >>> think.
> > > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > > Schema
> > > > > off
> > > > > >>> the GetFlightInfo response...
> > > > > >>>
> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
> rym...@dremio.com>
> > > > > wrote:
> > > > > >>>
> > > > > >>>> Hi All,
> > > > > >>>>
> > > > > >>>> I have been working on building an arrow flight source for
> > spark.
> > > > The
> > > > > >>>> goal
> > > > > >>>> here is for Spark to be able to use a group of arrow flight
> > > > endpoints
> > > > > >>>> to
> > > > > >>>> get a dataset pulled over to spark in parallel.
> > > > > >>>>
> > > > > >>>> I am unsure of the best model for the spark <-> flight
> > > conversation
> > > > > and
> > > > > >>>> wanted to get your opinion on the best way to go.
> > > > > >>>>
> > > > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> > > further
> > > > > >>>> lazy
> > > > > >>>> operations in Spark
> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with
> a
> > > > > >>>> different
> > > > > >>>> argument. This returns the list endpoints on the parallel
> flight
> > > > > >>>> server.
> > > > > >>>> The endpoints are not available till data is ready to be
> > fetched,
> > > > > which
> > > > > >>>> is
> > > > > >>>> done after the schema but is needed before DoGet is called.
> > > > > >>>> 3) call get stream on all endpoints from 2
> > > > > >>>>
> > > > > >>>> I think I have to do each step however I don't like having to
> > call
> > > > > >>>> getInfo
> > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom
> bytes
> > > cmd
> > > > > to
> > > > > >>>> differentiate the purpose of each call
> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
> called
> > > only
> > > > > >>>> for
> > > > > >>>> the schema
> > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > > > return
> > > > > >>>> just
> > > > > >>>> the Schema in question
> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > > > >>>>
> > > > > >>>> I am aware that 4 is probably the least disruptive but I'm
> also
> > > not
> > > > a
> > > > > >>>> fan
> > > > > >>>> as (to me) it implies performing an action on the server side.
> > > > > >>>> Suggestions
> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> > > there
> > > > is
> > > > > >>>> a
> > > > > >>>> consensus here. None of them are great options and I am
> > wondering
> > > > what
> > > > > >>>> everyone thinks the best approach might be? Particularly as I
> > > think
> > > > > this
> > > > > >>>> is
> > > > > >>>> likely to come up in more applications than just spark.
> > > > > >>>>
> > > > > >>>> Best,
> > > > > >>>> Ryan
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Ryan Murray  | Principal Consulting Engineer
> > >
> > > +447540852009 | rym...@dremio.com
> > >
> > > <https://www.dremio.com/>
> > > Check out our GitHub <https://www.github.com/dremio>, join our
> community
> > > site <https://community.dremio.com/> & Download Dremio
> > > <https://www.dremio.com/download>
> > >
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>

Reply via email to