> > Looking at what's there, it's exposing the entire result of do_get() to > Spark as one partition so we'll need to change it to expose each > VectorSchemaRoot as one partition instead.
Do you mean for each batch returned by do_get? I don't think that is possible. unless you want all data to be in memory. For spark you need to tell it how many partitions a dataframe. The only indication you would get for this is the number of Tickets returned by GetFlightInfo, unless you were to retrieve all batches. As far as I know you can't dynamically resize partitions as you get more information. I'm not sure I understand what you are trying to accomplish exactly (again not an expert here so starting a thread on a @spark mailing list probably makes sense). Cheers, Micah On Fri, Mar 25, 2022 at 2:04 PM James Duong <[email protected]> wrote: > Thanks Micah. Yes, it's for PySpark. > > Looking at what's there, it's exposing the entire result of do_get() to > Spark as one partition so we'll need to change it to expose each > VectorSchemaRoot as one partition instead. > > On Fri, Mar 25, 2022 at 12:31 PM Micah Kornfield <[email protected]> > wrote: > >> This is for pyspark? >> >> This might be a better question on the Spark mailing list for mechanics, >> but I think the only way to do this efficiently is going though the >> DataSource/DataSourceV2 APIs in Scala/Java. I think Ryan Murray prototyped >> something along this path a while ago [1], but I'm not sure of its current >> state. >> >> [1] >> https://github.com/rymurr/flight-spark-source/blob/master/src/main/java/org/apache/arrow/flight/spark/FlightDataReader.java >> >> On Fri, Mar 25, 2022 at 12:01 PM James Duong <[email protected]> >> wrote: >> >>> Most of the examples I've seen for loading data into Spark convert a >>> FlightStreamReader to pandas using to_pandas(). >>> >>> However this seems to load the entire contents of the stream into >>> memory. Is there an easy way to load individual chunks from the >>> FlightStream into dataframes? I figure you could call get_chunk() to get a >>> record batch then convert the batch to a dataframe. >>> >>> -- >>> >>> *James Duong* >>> Lead Software Developer >>> Bit Quill Technologies Inc. >>> Direct: +1.604.562.6082 | [email protected] >>> https://www.bitquilltech.com >>> >>> This email message is for the sole use of the intended recipient(s) and >>> may contain confidential and privileged information. Any unauthorized >>> review, use, disclosure, or distribution is prohibited. If you are not the >>> intended recipient, please contact the sender by reply email and destroy >>> all copies of the original message. Thank you. >>> >> > > -- > > *James Duong* > Lead Software Developer > Bit Quill Technologies Inc. > Direct: +1.604.562.6082 | [email protected] > https://www.bitquilltech.com > > This email message is for the sole use of the intended recipient(s) and > may contain confidential and privileged information. Any unauthorized > review, use, disclosure, or distribution is prohibited. If you are not the > intended recipient, please contact the sender by reply email and destroy > all copies of the original message. Thank you. >
