hi Subash, I'm only familiar with Question 1. Spark only makes use of Arrow for accelerating Python and R UDF evaluation and sending data to and from those language APIs (see our blog posts for some discussion about this). So I would guess for what you're saying there aren't any speedups unless there's new development I have not heard of. There is some internal columnar processing in Spark but I don't know if Arrow is being used (there was some discussion of this but I'm not sure where things currently stand).
On the 2nd question would have to defer to others who know better. Thanks Wes On Sun, Feb 16, 2020 at 7:53 AM Subash Prabakar <subashpraba...@gmail.com> wrote: > > Hi all, > > I could understand the use of Arrow in our projects to have > inter-operability as well as faster access. I have couple of questions on > how we can use for the following usecase and whether is it a good way of > usage, > > 1. Will the Spark execution be faster when I use joins on DF with Arrow > compared to normal Parquet format ? Due to serialization and > deserialization shuffling cost is lesser ? Is it? > > > 2. If I have a use case of running aggregate queries on a very huge table > (say 10TB) containing say few dimensions and very few metrics - Is it good > to use Arrow as intermediate caching layer for interactive queries ? (Low > latency queries) > Note: Dremio contains this by default - should I explore it or Impala or > Drill for this use case ? > > > Thanks, > Subash