I did some basic testing of multi-source queries with the most recent Spark:
https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115
The output of "spark.time()" surprised me:
SELECT p.id, p.name, t.id, t.title
FROM
Hello,
I recently noticed that spark doesn't optimize the joins when we are
limiting it.
Say when we have
payment.join(customer,Seq("customerId"), "left").limit(1).explain(true)
Spark doesn't optimize it.
> == Physical Plan ==
> CollectLimit 1
> +- *(5) Project [customerId#29, paymentId#28,
;
> On Thu, Jul 6, 2017 at 5:28 PM, kant kodali <kanth...@gmail.com> wrote:
>
>> HI All,
>>
>> I am wondering If I pass a raw SQL string to dataframe do I still get the
>> Spark SQL optimizations? why or why not?
>>
>> Thanks!
>>
>
>
--
Best Regards,
Ayan Guha
It goes through the same optimization pipeline. More in this video
<https://youtu.be/1a4pgYzeFwE?t=608>.
On Thu, Jul 6, 2017 at 5:28 PM, kant kodali <kanth...@gmail.com> wrote:
> HI All,
>
> I am wondering If I pass a raw SQL string to dataframe do I still get the
> Spar
HI All,
I am wondering If I pass a raw SQL string to dataframe do I still get the
Spark SQL optimizations? why or why not?
Thanks!
a list. I think the area is in heavy development
> esp. optimizations for typed operations.
>
> There's a JIRA to somehow find out more on the behavior of Scala code
> (non-Column-based one from your list) but I've seen no activity in this
> area. That's why for now Column-based untyp
{ c =>
if (castToInts.contains(c)) {
dfIn(c).cast(IntegerType)
} else {
dfIn(c)
}
}
dfIn.select(columns: _*)
}
As I consequently applied this to other similar functions the unit tests
went down from 60 to 18 minutes.
Another way to break SQL optimizations was to jus
Hi Justin,
I have never seen such a list. I think the area is in heavy development
esp. optimizations for typed operations.
There's a JIRA to somehow find out more on the behavior of Scala code
(non-Column-based one from your list) but I've seen no activity in this
area. That's why for now
ction, so I have two
questions really:
1.) Is there a list of the methods that lose some of the optimizations that
you get from non-functional methods? Is it any method that accepts a generic
function?
2.) Is there any work to attempt reflection and gain some of these
optimizations back? I couldn
Hi,
Can the Spark SQL Optimizations be disabled somehow?
In our project we started 4 weeks ago to write scala / spark / dataframe
code. We currently have only around 10% of the planned project scope, and we
are already waiting 10 (Spark 2.1.0, everything cached) to 30 (Spark 1.6,
nothing cached
With RDD API, you could optimize shuffling data by making sure that bytes
are shuffled instead of objects and using the appropriate ser/de mechanism
before and after the shuffle, for example:
Before parallelize, transform to bytes using a dedicated serializer,
parallelize, and immediately after
Hi,
I read about release notes and few slideshares on latest optimizations done
on spark 1.4 and 1.5 releases. Part of which are optimizations from project
Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd
structure into byte array before shuffle for optimized GC and memory. My
This is the basic design of spark that it runs all actions in different
stages...
Not sure you can achieve what you r looking for.
On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote:
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like
: Friday, July 3, 2015 at 3:13 AM
To: user
Subject: Optimizations
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage ? ... such
that each result partition after join is passed to
the mapPartitionToPair
query in from of any job.
My assumption is that it is hotspot optimizations that used due first
reading. Do you have any idea how to confirm/solve this performance problem?
Thanks for advice!
p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation
but can not figure out what
.
This problem is really annoying, because most of my spark task contains just
1 sql query and data processing and to speedup my jobs I put special warmup
query in from of any job.
My assumption is that it is hotspot optimizations that used due first
reading. Do you have any idea how
processing and to speedup my jobs I put special
warmup
query in from of any job.
My assumption is that it is hotspot optimizations that used due first
reading. Do you have any idea how to confirm/solve this performance
problem?
Thanks for advice!
p.s. I have billion hotspot
is that it is hotspot optimizations that used due first
reading. Do you have any idea how to confirm/solve this performance
problem?
Thanks for advice!
p.s. I have billion hotspot optimization showed
with -XX:+PrintCompilation but can not figure out what are important
and what
19 matches
Mail list logo