Re: How to force statistics calculation of Dataframe?

Reynold Xin Thu, 05 Nov 2015 10:09:12 -0800

If your data came from RDDs (i.e. not a file system based data source), and
you don't want to cache, then no ....



On Wed, Nov 4, 2015 at 3:51 PM, Charmee Patel <charm...@gmail.com> wrote:

> Due to other reasons we are using spark sql, not dataframe api. I saw that
> broadcast hint is only available on dataframe api.
>
> On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Can you use the broadcast hint?
>>
>> e.g.
>>
>> df1.join(broadcast(df2))
>>
>> the broadcast function is in org.apache.spark.sql.functions
>>
>>
>>
>> On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel <charm...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> If I have a hive table, analyze table compute statistics will ensure
>>> Spark SQL has statistics of that table. When I have a dataframe, is there a
>>> way to force spark to collect statistics?
>>>
>>> I have a large lookup file and I am trying to avoid a broadcast join by
>>> applying a filter before hand. This filtered RDD does not have statistics
>>> and so catalyst does not force a broadcast join. Unfortunately I have to
>>> use spark sql and cannot use dataframe api so cannot give a broadcast hint
>>> in the join.
>>>
>>> Example is this -
>>> If filtered RDD is saved as a table and compute stats is run, statistics
>>> are
>>>
>>> test.queryExecution.analyzed.statistics
>>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(38851747)
>>>
>>>
>>> filtered RDD as is gives
>>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(58403444019505585)
>>>
>>> filtered RDD forced to be materialized (cache/count), causes a different
>>> issue. Executors goes in a deadlock type state where not a single thread
>>> runs - for hours. I suspect cache a dataframe + broadcast join on same
>>> dataframe does this. As soon as cache is removed, the job moves forward.
>>>
>>> If there was a way for me to force statistics collection without caching
>>> a dataframe so Spark SQL would use it in a broadcast join?
>>>
>>> Thanks,
>>> Charmee
>>>
>>
>>

Re: How to force statistics calculation of Dataframe?

Reply via email to