Join selection

2019-03-04 Thread Akhilanand
Hello, I was going through the Spark strategies class and found that by default Sort merge join is preferred over shuffled hash join. The preferSortMergeJoin needs to be explicitly set to False if we have to force a shuffled hash join. 1) why is Sort merge join preferred over hash join? 2) are

Spark sql join optimizations

2019-02-26 Thread Akhilanand
Hello, I recently noticed that spark doesn't optimize the joins when we are limiting it. Say when we have payment.join(customer,Seq("customerId"), "left").limit(1).explain(true) Spark doesn't optimize it. > == Physical Plan == > CollectLimit 1 > +- *(5) Project [customerId#29, paymentId#28,

Difference between Typed and untyped transformation in dataset API

2019-02-21 Thread Akhilanand
What is the key difference between Typed and untyped transformation in dataset API? How do I determine if its typed or untyped? Any gotchas when to use what apart from the reason that it does the job for me?

Re: Difference between dataset and dataframe

2019-02-18 Thread Akhilanand
. in general if you use > Dataset you miss out on some optimizations. also Encoders are not very > pleasant to work with. > >> On Mon, Feb 18, 2019 at 9:09 PM Akhilanand wrote: >> >> Hello, >> >> I have been recently exploring about dataset and datafram

Difference between dataset and dataframe

2019-02-18 Thread Akhilanand
couldn’t find anything that tells it specifically. If its just for datasets , does that mean we miss out on the project tungsten optimisation for dataframes? Regards, Akhilanand BV - To unsubscribe e-mail: user-unsubscr