Re: How to do broadcast join in SparkSQL

Michael Armbrust Wed, 08 Oct 2014 12:11:48 -0700

Thanks for the input.  We purposefully made sure that the config option did
not make it into a release as it is not something that we are willing to
support long term.  That said we'll try and make this easier in the future
either through hints or better support for statistics.


In this particular case you can get what you want by registering the tables
as external tables and setting an flag.  Here's a helper function to do
what you need.

/**
 * Sugar for creating a Hive external table from a parquet path.
 */
def createParquetTable(name: String, file: String): Unit = {
  import org.apache.spark.sql.hive.HiveMetastoreTypes

  val rdd = parquetFile(file)
  val schema = rdd.schema.fields.map(f => s"${f.name}
${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
  val ddl = s"""
    |CREATE EXTERNAL TABLE $name (
    |  $schema
    |)
    |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
    |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
    |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
    |LOCATION '$file'""".stripMargin
  sql(ddl)
  setConf("spark.sql.hive.convertMetastoreParquet", "true")
}

You'll also need to run this to populate the statistics:

ANALYZE TABLE  tableName COMPUTE STATISTICS noscan;


On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang <jianshi.hu...@gmail.com>
wrote:

> Ok, currently there's cost-based optimization however Parquet statistics
> is not implemented...
>
> What's the good way if I want to join a big fact table with several tiny
> dimension tables in Spark SQL (1.1)?
>
> I wish we can allow user hint for the join.
>
> Jianshi
>
> On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
>
>> Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not
>> merged into master?
>>
>> I cannot find spark.sql.hints.broadcastTables in latest master, but it's
>> in the following patch.
>>
>>
>> https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
>>
>>
>> Jianshi
>>
>>
>> On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang <jianshi.hu...@gmail.com>
>> wrote:
>>
>>> Yes, looks like it can only be controlled by the
>>> parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
>>> to me.
>>>
>>> How am I suppose to know the exact bytes of a table? Let me specify the
>>> join algorithm is preferred I think.
>>>
>>> Jianshi
>>>
>>> On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Have you looked at SPARK-1800 ?
>>>>
>>>> e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
>>>> Cheers
>>>>
>>>> On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang <jianshi.hu...@gmail.com
>>>> > wrote:
>>>>
>>>>> I cannot find it in the documentation. And I have a dozen dimension
>>>>> tables to (left) join...
>>>>>
>>>>>
>>>>> Cheers,
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: How to do broadcast join in SparkSQL

Reply via email to