Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

Cheng Lian Tue, 20 Jan 2015 11:37:17 -0800

|spark.sql.parquet.filterPushdown| defaults to |false| because there’s abug in Parquet which may cause NPE, please refer tohttp://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

This bug hasn’t been fixed in Parquet master. We’ll turn this on oncethe bug is fixed.


Cheng

On 1/19/15 5:02 PM, Xiaoyu Wang wrote:

The *spark.sql.parquet.**filterPushdown=true *has been turned on. Butset *spark.sql.hive.**convertMetastoreParquet *to *false*. the firstparameter is lose efficacy!!!

2015-01-20 6:52 GMT+08:00 Yana Kadiyska <yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>>:


    If you're talking about filter pushdowns for parquet files this
    also has to be turned on explicitly. Try
    *spark.sql.parquet.**filterPushdown=true . *It's off by default

    On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang <wangxy...@gmail.com
    <mailto:wangxy...@gmail.com>> wrote:

        Yes it works!
        But the filter can't pushdown!!!

        If custom parquetinputformat only implement the datasource API?

        
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

        2015-01-16 21:51 GMT+08:00 Xiaoyu Wang <wangxy...@gmail.com
        <mailto:wangxy...@gmail.com>>:

            Thanks yana!
            I will try it!

            在 2015年1月16日，20:51，yana <yana.kadiy...@gmail.com
            <mailto:yana.kadiy...@gmail.com>> 写道：

            I think you might need to set
            spark.sql.hive.convertMetastoreParquet to false if I
            understand that flag correctly

            Sent on the new Sprint Network from my Samsung Galaxy S®4.


            -------- Original message --------
            From: Xiaoyu Wang
            Date:01/16/2015 5:09 AM (GMT-05:00)
            To: user@spark.apache.org <mailto:user@spark.apache.org>
            Subject: Why custom parquet format hive table execute
            "ParquetTableScan" physical plan, not "HiveTableScan"?

            Hi all!

            In the Spark SQL1.2.0.
            I create a hive table with custom parquet inputformat and
            outputformat.
            like this :
            CREATE TABLE test(
              id string,
              msg string)
            CLUSTERED BY (
              id)
            SORTED BY (
              id ASC)
            INTO 10 BUCKETS
            ROW FORMAT SERDE
              '*com.a.MyParquetHiveSerDe*'
            STORED AS INPUTFORMAT
              '*com.a.MyParquetInputFormat*'
            OUTPUTFORMAT
              '*com.a.MyParquetOutputFormat*';

            And the spark shell see the plan of "select * from test" is :

            [== Physical Plan ==]
            [!OutputFaker [id#5,msg#6]]
            [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
            hdfs://hadoop/user/hive/warehouse/test.db/test,
            Some(Configuration: core-default.xml, core-site.xml,
            mapred-default.xml, mapred-site.xml, yarn-default.xml,
            yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
            org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

            *Not HiveTableScan*!!!
            *So it dosn't execute my custom inputformat!*
            Why? How can it execute my custom inputformat?

            Thanks!

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

Reply via email to