[SQL] Why does a small two-source JDBC query take ~150-200ms with all optimizations (AQE, CBO, pushdown, Kryo, unsafe) enabled? (v3.4.0-SNAPSHOT)

2022-05-18 Thread Gavin Ray
I did some basic testing of multi-source queries with the most recent Spark: https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115 The output of "spark.time()" surprised me: SELECT p.id, p.name, t.id, t.title FROM

Spark sql join optimizations

2019-02-26 Thread Akhilanand
Hello, I recently noticed that spark doesn't optimize the joins when we are limiting it. Say when we have payment.join(customer,Seq("customerId"), "left").limit(1).explain(true) Spark doesn't optimize it. > == Physical Plan == > CollectLimit 1 > +- *(5) Project [customerId#29, paymentId#28,

Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread ayan guha
; > On Thu, Jul 6, 2017 at 5:28 PM, kant kodali <kanth...@gmail.com> wrote: > >> HI All, >> >> I am wondering If I pass a raw SQL string to dataframe do I still get the >> Spark SQL optimizations? why or why not? >> >> Thanks! >> > > -- Best Regards, Ayan Guha

Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread Michael Armbrust
It goes through the same optimization pipeline. More in this video <https://youtu.be/1a4pgYzeFwE?t=608>. On Thu, Jul 6, 2017 at 5:28 PM, kant kodali <kanth...@gmail.com> wrote: > HI All, > > I am wondering If I pass a raw SQL string to dataframe do I still get the > Spar

If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread kant kodali
HI All, I am wondering If I pass a raw SQL string to dataframe do I still get the Spark SQL optimizations? why or why not? Thanks!

Re: Is there a list of missing optimizations for typed functions?

2017-02-27 Thread lihu
a list. I think the area is in heavy development > esp. optimizations for typed operations. > > There's a JIRA to somehow find out more on the behavior of Scala code > (non-Column-based one from your list) but I've seen no activity in this > area. That's why for now Column-based untyp

Re: Disable Spark SQL Optimizations for unit tests

2017-02-26 Thread Stefan Ackermann
{ c => if (castToInts.contains(c)) { dfIn(c).cast(IntegerType) } else { dfIn(c) } } dfIn.select(columns: _*) } As I consequently applied this to other similar functions the unit tests went down from 60 to 18 minutes. Another way to break SQL optimizations was to jus

Re: Is there a list of missing optimizations for typed functions?

2017-02-24 Thread Jacek Laskowski
Hi Justin, I have never seen such a list. I think the area is in heavy development esp. optimizations for typed operations. There's a JIRA to somehow find out more on the behavior of Scala code (non-Column-based one from your list) but I've seen no activity in this area. That's why for now

Is there a list of missing optimizations for typed functions?

2017-02-22 Thread Justin Pihony
ction, so I have two questions really: 1.) Is there a list of the methods that lose some of the optimizations that you get from non-functional methods? Is it any method that accepts a generic function? 2.) Is there any work to attempt reflection and gain some of these optimizations back? I couldn

Disable Spark SQL Optimizations for unit tests

2017-02-11 Thread Stefan Ackermann
Hi, Can the Spark SQL Optimizations be disabled somehow? In our project we started 4 weeks ago to write scala / spark / dataframe code. We currently have only around 10% of the planned project scope, and we are already waiting 10 (Spark 2.1.0, everything cached) to 30 (Spark 1.6, nothing cached

Are ser/de optimizations relevant with Dataset API and Encoders ?

2016-06-19 Thread Amit Sela
With RDD API, you could optimize shuffling data by making sure that bytes are shuffled instead of objects and using the appropriate ser/de mechanism before and after the shuffle, for example: Before parallelize, transform to bytes using a dedicated serializer, parallelize, and immediately after

Spark 1.5.2 - are new Project Tungsten optimizations available on RDD as well?

2016-02-02 Thread Nirav Patel
Hi, I read about release notes and few slideshares on latest optimizations done on spark 1.4 and 1.5 releases. Part of which are optimizations from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array before shuffle for optimized GC and memory. My

Re: Optimizations

2015-07-03 Thread Raghavendra Pandey
This is the basic design of spark that it runs all actions in different stages... Not sure you can achieve what you r looking for. On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote: Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like

Re: Optimizations

2015-07-03 Thread Marius Danciu
: Friday, July 3, 2015 at 3:13 AM To: user Subject: Optimizations Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage

Optimizations

2015-07-03 Thread Marius Danciu
Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair

Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
query in from of any job. My assumption is that it is hotspot optimizations that used due first reading. Do you have any idea how to confirm/solve this performance problem? Thanks for advice! p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation but can not figure out what

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Sean Owen
. This problem is really annoying, because most of my spark task contains just 1 sql query and data processing and to speedup my jobs I put special warmup query in from of any job. My assumption is that it is hotspot optimizations that used due first reading. Do you have any idea how

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
processing and to speedup my jobs I put special warmup query in from of any job. My assumption is that it is hotspot optimizations that used due first reading. Do you have any idea how to confirm/solve this performance problem? Thanks for advice! p.s. I have billion hotspot

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Guillaume Pitel
is that it is hotspot optimizations that used due first reading. Do you have any idea how to confirm/solve this performance problem? Thanks for advice! p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation but can not figure out what are important and what