Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Reynold Xin
Are we sure "not padding" is "incorrect"? I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad. https://docs.snowflake.net/manuals/sql-reference/data-types-text.html (

Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Dongjoon Hyun
Hi, Reynold. Please see the following for the context. https://issues.apache.org/jira/browse/SPARK-31136 "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax" I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Manjunath Shetty H
Only partitioned and Join keys are not sorted coz those are written incrementally with batch jobs From: Georg Heiler Sent: Sunday, March 15, 2020 8:30:53 PM To: Manjunath Shetty H Cc: ayan guha ; Magnus Nilsson ; user Subject: Re: Optimising multiple hive

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Georg Heiler
Did you only partition or also bucket by the join column? Are ORCI indices active i.e. the JOIN keys sorted when writing the files? Best, Georg Am So., 15. März 2020 um 15:52 Uhr schrieb Manjunath Shetty H < manjunathshe...@live.com>: > Mostly the concern is the reshuffle. Even though all the

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Manjunath Shetty H
Mostly the concern is the reshuffle. Even though all the DF's are partitioned by same column. During join it does reshuffle, that is the bottleneck as of now in our POC implementation. Is there any way tell spark that keep all partitions with same partition key at the same place so that during

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread ayan guha
Hi I would first and foremost try to identify where is the most time spend during the query. One possibility is it just takes ramp up time for executors to be available, if thats the case then maybe a dedicated yarn queue may help, or using Spark thriftserver may help. On Sun, Mar 15, 2020 at

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Magnus Nilsson
Been a while but I remember reading on Stack Overflow you can use a UDF as a join condition to trick catalyst into not reshuffling the partitions, ie use regular equality on the column you partitioned or bucketed by and your custom comparer for the other columns. Never got around to try it out

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Dennis Suhari
Hi, I am also using Spark on Hive Metastore. The performance is much more better esp. for larger datasets. I have the feeling that the performance is better if I load the data into dataframes and do a join instead of doing direct join within SparkSQL. But i can’t explain yet. Any body