It works fine, thanks for the help Michael.

Liancheng also told me a trick, using a subquery with LIMIT n. It works in
latest 1.2.0

BTW, looks like the broadcast optimization won't be recognized if I do a
left join instead of a inner join. Is that true? How can I make it work for
left joins?

Cheers,
Jianshi

On Thu, Oct 9, 2014 at 3:10 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> Thanks for the input.  We purposefully made sure that the config option
> did not make it into a release as it is not something that we are willing
> to support long term.  That said we'll try and make this easier in the
> future either through hints or better support for statistics.
>
> In this particular case you can get what you want by registering the
> tables as external tables and setting an flag.  Here's a helper function to
> do what you need.
>
> /**
>  * Sugar for creating a Hive external table from a parquet path.
>  */
> def createParquetTable(name: String, file: String): Unit = {
>   import org.apache.spark.sql.hive.HiveMetastoreTypes
>
>   val rdd = parquetFile(file)
>   val schema = rdd.schema.fields.map(f => s"${f.name}
> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>   val ddl = s"""
>     |CREATE EXTERNAL TABLE $name (
>     |  $schema
>     |)
>     |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
>     |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
>     |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
>     |LOCATION '$file'""".stripMargin
>   sql(ddl)
>   setConf("spark.sql.hive.convertMetastoreParquet", "true")
> }
>
> You'll also need to run this to populate the statistics:
>
> ANALYZE TABLE  tableName COMPUTE STATISTICS noscan;
>
>
> On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
>
>> Ok, currently there's cost-based optimization however Parquet statistics
>> is not implemented...
>>
>> What's the good way if I want to join a big fact table with several tiny
>> dimension tables in Spark SQL (1.1)?
>>
>> I wish we can allow user hint for the join.
>>
>> Jianshi
>>
>> On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang <jianshi.hu...@gmail.com>
>> wrote:
>>
>>> Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not
>>> merged into master?
>>>
>>> I cannot find spark.sql.hints.broadcastTables in latest master, but
>>> it's in the following patch.
>>>
>>>
>>> https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
>>>
>>>
>>> Jianshi
>>>
>>>
>>> On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang <jianshi.hu...@gmail.com>
>>> wrote:
>>>
>>>> Yes, looks like it can only be controlled by the
>>>> parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird
>>>> to me.
>>>>
>>>> How am I suppose to know the exact bytes of a table? Let me specify the
>>>> join algorithm is preferred I think.
>>>>
>>>> Jianshi
>>>>
>>>> On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Have you looked at SPARK-1800 ?
>>>>>
>>>>> e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
>>>>> Cheers
>>>>>
>>>>> On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang <
>>>>> jianshi.hu...@gmail.com> wrote:
>>>>>
>>>>>> I cannot find it in the documentation. And I have a dozen dimension
>>>>>> tables to (left) join...
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to