Re: Getting Flight data into Spark Dataframes

Micah Kornfield Fri, 25 Mar 2022 16:31:15 -0700

>
> Looking at what's there, it's exposing the entire result of do_get() to
> Spark as one partition so we'll need to change it to expose each
> VectorSchemaRoot as one partition instead.



Do you mean for each batch returned by do_get?   I don't think that is
possible.  unless you want all data to be in memory.  For spark you need to
tell it how many partitions a dataframe.  The only indication you would get
for this is the number of Tickets returned by GetFlightInfo, unless you
were to retrieve all batches.  As far as I know you can't dynamically
resize partitions as you get more information.

I'm not sure I understand what you are trying to accomplish exactly (again
not an expert here so starting a thread on a @spark mailing list probably
makes sense).

Cheers,
Micah

On Fri, Mar 25, 2022 at 2:04 PM James Duong <[email protected]> wrote:

> Thanks Micah. Yes, it's for PySpark.
>
> Looking at what's there, it's exposing the entire result of do_get() to
> Spark as one partition so we'll need to change it to expose each
> VectorSchemaRoot as one partition instead.
>
> On Fri, Mar 25, 2022 at 12:31 PM Micah Kornfield <[email protected]>
> wrote:
>
>> This is for pyspark?
>>
>> This might be a better question on the Spark mailing list for mechanics,
>> but I think the only way to do this efficiently is going though the
>> DataSource/DataSourceV2 APIs in Scala/Java.  I think Ryan Murray prototyped
>> something along this path a while ago [1], but I'm not sure of its current
>> state.
>>
>> [1]
>> https://github.com/rymurr/flight-spark-source/blob/master/src/main/java/org/apache/arrow/flight/spark/FlightDataReader.java
>>
>> On Fri, Mar 25, 2022 at 12:01 PM James Duong <[email protected]>
>> wrote:
>>
>>> Most of the examples I've seen for loading data into Spark convert a
>>> FlightStreamReader to pandas using to_pandas().
>>>
>>> However this seems to load the entire contents of the stream into
>>> memory. Is there an easy way to load individual chunks from the
>>> FlightStream into dataframes? I figure you could call get_chunk() to get a
>>> record batch then convert the batch to a dataframe.
>>>
>>> --
>>>
>>> *James Duong*
>>> Lead Software Developer
>>> Bit Quill Technologies Inc.
>>> Direct: +1.604.562.6082 | [email protected]
>>> https://www.bitquilltech.com
>>>
>>> This email message is for the sole use of the intended recipient(s) and
>>> may contain confidential and privileged information.  Any unauthorized
>>> review, use, disclosure, or distribution is prohibited.  If you are not the
>>> intended recipient, please contact the sender by reply email and destroy
>>> all copies of the original message.  Thank you.
>>>
>>
>
> --
>
> *James Duong*
> Lead Software Developer
> Bit Quill Technologies Inc.
> Direct: +1.604.562.6082 | [email protected]
> https://www.bitquilltech.com
>
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential and privileged information.  Any unauthorized
> review, use, disclosure, or distribution is prohibited.  If you are not the
> intended recipient, please contact the sender by reply email and destroy
> all copies of the original message.  Thank you.
>

Re: Getting Flight data into Spark Dataframes

Reply via email to