Hi,
I am also using Spark on Hive Metastore. The performance is much more better
esp. for larger datasets. I have the feeling that the performance is better if
I load the data into dataframes and do a join instead of doing direct join
within SparkSQL. But i can’t explain yet.
Any body experie
Been a while but I remember reading on Stack Overflow you can use a UDF as
a join condition to trick catalyst into not reshuffling the partitions, ie
use regular equality on the column you partitioned or bucketed by and your
custom comparer for the other columns. Never got around to try it out
houg
Hi
I would first and foremost try to identify where is the most time spend
during the query. One possibility is it just takes ramp up time for
executors to be available, if thats the case then maybe a dedicated yarn
queue may help, or using Spark thriftserver may help.
On Sun, Mar 15, 2020 at 11:
Mostly the concern is the reshuffle. Even though all the DF's are partitioned
by same column. During join it does reshuffle, that is the bottleneck as of now
in our POC implementation.
Is there any way tell spark that keep all partitions with same partition key at
the same place so that during
Did you only partition or also bucket by the join column? Are ORCI indices
active i.e. the JOIN keys sorted when writing the files?
Best,
Georg
Am So., 15. März 2020 um 15:52 Uhr schrieb Manjunath Shetty H <
manjunathshe...@live.com>:
> Mostly the concern is the reshuffle. Even though all the DF
Only partitioned and Join keys are not sorted coz those are written
incrementally with batch jobs
From: Georg Heiler
Sent: Sunday, March 15, 2020 8:30:53 PM
To: Manjunath Shetty H
Cc: ayan guha ; Magnus Nilsson ; user
Subject: Re: Optimising multiple hive table
Hi, Reynold.
Please see the following for the context.
https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax"
I raised the above issue according to the new rubric, and the banning was
the proposed alternative to reduce th
Are we sure "not padding" is "incorrect"?
I don't know whether ANSI SQL actually requires padding, but plenty of
databases don't actually pad.
https://docs.snowflake.net/manuals/sql-reference/data-types-text.html (
https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CH
Hi,
100% agree with Reynold.
Regards,
Gourav Sengupta
On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin wrote:
> Are we sure "not padding" is "incorrect"?
>
> I don't know whether ANSI SQL actually requires padding, but plenty of
> databases don't actually pad.
>
> https://docs.snowflake.net/manual