Re: Java - Spark dataframe to Arrow format

GaoXiang Wang Fri, 06 Dec 2019 05:11:47 -0800

Hi Wes and Liya,

Appreciate your feedback and information.


Looking forward to a more efficient integration between Arrow and Spark on
the Java/Scala level. I would like to make my contribution if I can help in
any way during my free time.

Thank you very much.


*Best Regards,WANG GAOXIANG*
* (Eric) *
National University of Singapore Graduate ::
API Craft Singapore Co-organiser ::
Singapore Python User Group Co-organiser
*+6597685360 (P) :: [email protected] <[email protected]> (E) ::
**https://medium.com/@wgx731
<https://medium.com/@wgx731> **(W)*


On Fri, Dec 6, 2019 at 6:17 PM Fan Liya <[email protected]> wrote:

> Hi folks,
>
> Thanks for your clarification.
>
> I also think this is a universal requirement (including Java UDF in Arrow
> format).
>
> The Java converter provided by Spark is inefficient, due to two reasons
> (IMO)
>
> 1. There are frequent memory copies between on-heap and off-heap memory.
> 2. The Spark API is in a row-oriented view (Iterator of InternalRow), so we
> need to perform some column/row conversion, and we cannot copy data in
> batch.
>
> To solve the problem, maybe we need something equivalent to pandas in Java
> (I think pandas acts as a bridge between PyArrow and PySpark).
> In addition, we need to integrate it in Arrow and Spark.
>
> Best,
> Liya Fan
>
> On Fri, Dec 6, 2019 at 2:14 AM Chen Li <[email protected]> wrote:
>
> > We have a similar use case, and we use ArrowConverters.scala mentioned by
> > Wes. However, the overhead of the conversion is kinda high.
> > ------------------------------
> > *From:* Wes McKinney <[email protected]>
> > *Sent:* Thursday, December 5, 2019 6:53 AM
> > *To:* dev <[email protected]>
> > *Cc:* Fan Liya <[email protected]>;
> > [email protected]
> > <[email protected]>
> > *Subject:* Re: Java - Spark dataframe to Arrow format
> >
> > hi folks,
> >
> > I understand the question to be about serialization.
> >
> > see
> >
> > *
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
> > *
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
> > *
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
> >
> > This code is used to convert between Spark Data Frames and Arrow
> > columnar format for UDF evaluation purposes
> >
> > On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <[email protected]> wrote:
> > >
> > > Hi Jeetendra and Liya,
> > >
> > > I am actually having a similar use case. We have some data stored as
> > *parquet
> > > format in HDFS* and would like to make use of Apache Arrow to improve
> > > compute performance if possible. Right now, I didn't see there is a
> > direct
> > > way to do in Java with Spark.
> > >
> > > I have search the Spark documentation, it looks like python support is
> > > added after 2.3.0 (
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE&e=
> > ),
> > > any plan from Apache Arrow team to provide *Spark integration for
> Java*?
> > >
> > > Thank you very much.
> > >
> > >
> > > *Best Regards,WANG GAOXIANG*
> > > * (Eric) *
> > > National University of Singapore Graduate ::
> > > API Craft Singapore Co-organiser ::
> > > Singapore Python User Group Co-organiser
> > > *+6597685360 (P) :: [email protected] <[email protected]> (E) ::
> > > **
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > **(W)*
> > >
> > >
> > > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <[email protected]> wrote:
> > >
> > > > Hi Jeetendra,
> > > >
> > > > I am not sure if I understand your question correctly.
> > > >
> > > > Arrow is an in-memory columnar data format, and Spark has its own
> > in-memory
> > > > data format for DataFrame, which is invisible to end users.
> > > > So the Spark user has no control over the underlying in-memory
> layout.
> > > >
> > > > If you really want to convert a DataFrame into Arrow format, maybe
> you
> > can
> > > > save the results of a Spark job to some external store (e.g. in ORC
> > > > format), and then load it back to memory in Arrow format (if this is
> > what
> > > > you want).
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > > > <[email protected]> wrote:
> > > >
> > > > > Hi Dev Team,
> > > > >
> > > > > Can someone please let me know how to convert spark data frame to
> > Arrow
> > > > > format. I am coding in Java.
> > > > >
> > > > > Java documentation of Arrow just has function API information. It
> is
> > > > > little hard to develop without proper documentation.
> > > > >
> > > > > Is there a way to directly convert spark dataframe to Arrow format
> > > > > dataframes.
> > > > >
> > > > > Thanks,
> > > > > Jeetendra
> > > > >
> > > > > ________________________________
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > NOTE: This message may contain information that is confidential,
> > > > > proprietary, privileged or otherwise protected by law. The message
> is
> > > > > intended solely for the named addressee. If received in error,
> please
> > > > > destroy and notify the sender. Any use of this email is prohibited
> > when
> > > > > received in error. Impetus does not represent, warrant and/or
> > guarantee,
> > > > > that the integrity of this communication has been maintained nor
> > that the
> > > > > communication is free of errors, virus, interception or
> interference.
> > > > >
> > > >
> >
>

Re: Java - Spark dataframe to Arrow format

Reply via email to