Hi Wes and Liya, Appreciate your feedback and information.
Looking forward to a more efficient integration between Arrow and Spark on the Java/Scala level. I would like to make my contribution if I can help in any way during my free time. Thank you very much. *Best Regards,WANG GAOXIANG* * (Eric) * National University of Singapore Graduate :: API Craft Singapore Co-organiser :: Singapore Python User Group Co-organiser *+6597685360 (P) :: wgx...@gmail.com <wgx...@gmail.com> (E) :: **https://medium.com/@wgx731 <https://medium.com/@wgx731> **(W)* On Fri, Dec 6, 2019 at 6:17 PM Fan Liya <liya.fa...@gmail.com> wrote: > Hi folks, > > Thanks for your clarification. > > I also think this is a universal requirement (including Java UDF in Arrow > format). > > The Java converter provided by Spark is inefficient, due to two reasons > (IMO) > > 1. There are frequent memory copies between on-heap and off-heap memory. > 2. The Spark API is in a row-oriented view (Iterator of InternalRow), so we > need to perform some column/row conversion, and we cannot copy data in > batch. > > To solve the problem, maybe we need something equivalent to pandas in Java > (I think pandas acts as a bridge between PyArrow and PySpark). > In addition, we need to integrate it in Arrow and Spark. > > Best, > Liya Fan > > On Fri, Dec 6, 2019 at 2:14 AM Chen Li <c...@fb.com> wrote: > > > We have a similar use case, and we use ArrowConverters.scala mentioned by > > Wes. However, the overhead of the conversion is kinda high. > > ------------------------------ > > *From:* Wes McKinney <wesmck...@gmail.com> > > *Sent:* Thursday, December 5, 2019 6:53 AM > > *To:* dev <dev@arrow.apache.org> > > *Cc:* Fan Liya <liya.fa...@gmail.com>; > > jeetendra.jais...@impetus.co.in.invalid > > <jeetendra.jais...@impetus.co.in.invalid> > > *Subject:* Re: Java - Spark dataframe to Arrow format > > > > hi folks, > > > > I understand the question to be about serialization. > > > > see > > > > * > > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java > > * > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala > > * > > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala > > > > This code is used to convert between Spark Data Frames and Arrow > > columnar format for UDF evaluation purposes > > > > On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <wgx...@gmail.com> wrote: > > > > > > Hi Jeetendra and Liya, > > > > > > I am actually having a similar use case. We have some data stored as > > *parquet > > > format in HDFS* and would like to make use of Apache Arrow to improve > > > compute performance if possible. Right now, I didn't see there is a > > direct > > > way to do in Java with Spark. > > > > > > I have search the Spark documentation, it looks like python support is > > > added after 2.3.0 ( > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE&e= > > ), > > > any plan from Apache Arrow team to provide *Spark integration for > Java*? > > > > > > Thank you very much. > > > > > > > > > *Best Regards,WANG GAOXIANG* > > > * (Eric) * > > > National University of Singapore Graduate :: > > > API Craft Singapore Co-organiser :: > > > Singapore Python User Group Co-organiser > > > *+6597685360 (P) :: wgx...@gmail.com <wgx...@gmail.com> (E) :: > > > ** > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e= > > > < > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e= > > > **(W)* > > > > > > > > > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > > > Hi Jeetendra, > > > > > > > > I am not sure if I understand your question correctly. > > > > > > > > Arrow is an in-memory columnar data format, and Spark has its own > > in-memory > > > > data format for DataFrame, which is invisible to end users. > > > > So the Spark user has no control over the underlying in-memory > layout. > > > > > > > > If you really want to convert a DataFrame into Arrow format, maybe > you > > can > > > > save the results of a Spark job to some external store (e.g. in ORC > > > > format), and then load it back to memory in Arrow format (if this is > > what > > > > you want). > > > > > > > > Best, > > > > Liya Fan > > > > > > > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal > > > > <jeetendra.jais...@impetus.co.in.invalid> wrote: > > > > > > > > > Hi Dev Team, > > > > > > > > > > Can someone please let me know how to convert spark data frame to > > Arrow > > > > > format. I am coding in Java. > > > > > > > > > > Java documentation of Arrow just has function API information. It > is > > > > > little hard to develop without proper documentation. > > > > > > > > > > Is there a way to directly convert spark dataframe to Arrow format > > > > > dataframes. > > > > > > > > > > Thanks, > > > > > Jeetendra > > > > > > > > > > ________________________________ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > NOTE: This message may contain information that is confidential, > > > > > proprietary, privileged or otherwise protected by law. The message > is > > > > > intended solely for the named addressee. If received in error, > please > > > > > destroy and notify the sender. Any use of this email is prohibited > > when > > > > > received in error. Impetus does not represent, warrant and/or > > guarantee, > > > > > that the integrity of this communication has been maintained nor > > that the > > > > > communication is free of errors, virus, interception or > interference. > > > > > > > > > > > >