If your data came from RDDs (i.e. not a file system based data source), and you don't want to cache, then no ....
On Wed, Nov 4, 2015 at 3:51 PM, Charmee Patel <charm...@gmail.com> wrote: > Due to other reasons we are using spark sql, not dataframe api. I saw that > broadcast hint is only available on dataframe api. > > On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin <r...@databricks.com> wrote: > >> Can you use the broadcast hint? >> >> e.g. >> >> df1.join(broadcast(df2)) >> >> the broadcast function is in org.apache.spark.sql.functions >> >> >> >> On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel <charm...@gmail.com> >> wrote: >> >>> Hi, >>> >>> If I have a hive table, analyze table compute statistics will ensure >>> Spark SQL has statistics of that table. When I have a dataframe, is there a >>> way to force spark to collect statistics? >>> >>> I have a large lookup file and I am trying to avoid a broadcast join by >>> applying a filter before hand. This filtered RDD does not have statistics >>> and so catalyst does not force a broadcast join. Unfortunately I have to >>> use spark sql and cannot use dataframe api so cannot give a broadcast hint >>> in the join. >>> >>> Example is this - >>> If filtered RDD is saved as a table and compute stats is run, statistics >>> are >>> >>> test.queryExecution.analyzed.statistics >>> org.apache.spark.sql.catalyst.plans.logical.Statistics = >>> Statistics(38851747) >>> >>> >>> filtered RDD as is gives >>> org.apache.spark.sql.catalyst.plans.logical.Statistics = >>> Statistics(58403444019505585) >>> >>> filtered RDD forced to be materialized (cache/count), causes a different >>> issue. Executors goes in a deadlock type state where not a single thread >>> runs - for hours. I suspect cache a dataframe + broadcast join on same >>> dataframe does this. As soon as cache is removed, the job moves forward. >>> >>> If there was a way for me to force statistics collection without caching >>> a dataframe so Spark SQL would use it in a broadcast join? >>> >>> Thanks, >>> Charmee >>> >> >>