hi Subash,

I'm only familiar with Question 1. Spark only makes use of Arrow for
accelerating Python and R UDF evaluation and sending data to and from
those language APIs (see our blog posts for some discussion about
this). So I would guess for what you're saying there aren't any
speedups unless there's new development I have not heard of. There is
some internal columnar processing in Spark but I don't know if Arrow
is being used (there was some discussion of this but I'm not sure
where things currently stand).

On the 2nd question would have to defer to others who know better.

Thanks
Wes

On Sun, Feb 16, 2020 at 7:53 AM Subash Prabakar
<subashpraba...@gmail.com> wrote:
>
> Hi all,
>
> I could understand the use of Arrow in our projects to have
> inter-operability as well as faster access. I have couple of questions on
> how we can use for the following usecase and whether is it a good way of
> usage,
>
> 1. Will the Spark execution be faster when I use joins on DF with Arrow
> compared to normal Parquet format ? Due to serialization and
> deserialization shuffling cost is lesser ? Is it?
>
>
> 2. If I have a use case of running aggregate queries on a very huge table
> (say 10TB) containing say few dimensions and very few metrics - Is it good
> to use Arrow as intermediate caching layer for interactive queries ? (Low
> latency queries)
> Note: Dremio contains this by default  - should I explore it or Impala or
> Drill for this use case ?
>
>
> Thanks,
> Subash

Reply via email to