Hm, you might want to ask on the dev list if you don't get a good answer here. I'm also trying to decipher this part of the code as I'm having issues with predicate pushes. I can see (in master branch) that the SQL codepath (which is taken if you don't convert the metastore) C:\spark-master\sql\core\src\main\scala\org\apache\spark\sql\parquet\ParquetTableOperations.scala around line 107 pushed the parquet filters into a hadoop configuration object . Spark1.2 has similar code in the same file, via method ParquetInputFormat.setFilterPredicate. But I think in the case where you go through HiveTableScan you'd go through C:\spark-master\sql\hive\src\main\scala\org\apache\spark\sql\hive\TableReader.scala and I don't see anything happening with the filters there. But I'm not a dev on this project -- mostly I'm really interested in the answer. Please do update if you figure this out!
On Mon, Jan 19, 2015 at 8:02 PM, Xiaoyu Wang <wangxy...@gmail.com> wrote: > The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set > *spark.sql.hive.**convertMetastoreParquet *to *false*. the first > parameter is lose efficacy!!! > > 2015-01-20 6:52 GMT+08:00 Yana Kadiyska <yana.kadiy...@gmail.com>: > >> If you're talking about filter pushdowns for parquet files this also has >> to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true >> . *It's off by default >> >> On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang <wangxy...@gmail.com> wrote: >> >>> Yes it works! >>> But the filter can't pushdown!!! >>> >>> If custom parquetinputformat only implement the datasource API? >>> >>> >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala >>> >>> 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang <wangxy...@gmail.com>: >>> >>>> Thanks yana! >>>> I will try it! >>>> >>>> 在 2015年1月16日,20:51,yana <yana.kadiy...@gmail.com> 写道: >>>> >>>> I think you might need to set >>>> spark.sql.hive.convertMetastoreParquet to false if I understand that >>>> flag correctly >>>> >>>> Sent on the new Sprint Network from my Samsung Galaxy S®4. >>>> >>>> >>>> -------- Original message -------- >>>> From: Xiaoyu Wang >>>> Date:01/16/2015 5:09 AM (GMT-05:00) >>>> To: user@spark.apache.org >>>> Subject: Why custom parquet format hive table execute >>>> "ParquetTableScan" physical plan, not "HiveTableScan"? >>>> >>>> Hi all! >>>> >>>> In the Spark SQL1.2.0. >>>> I create a hive table with custom parquet inputformat and outputformat. >>>> like this : >>>> CREATE TABLE test( >>>> id string, >>>> msg string) >>>> CLUSTERED BY ( >>>> id) >>>> SORTED BY ( >>>> id ASC) >>>> INTO 10 BUCKETS >>>> ROW FORMAT SERDE >>>> '*com.a.MyParquetHiveSerDe*' >>>> STORED AS INPUTFORMAT >>>> '*com.a.MyParquetInputFormat*' >>>> OUTPUTFORMAT >>>> '*com.a.MyParquetOutputFormat*'; >>>> >>>> And the spark shell see the plan of "select * from test" is : >>>> >>>> [== Physical Plan ==] >>>> [!OutputFaker [id#5,msg#6]] >>>> [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation >>>> hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: >>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, >>>> yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), >>>> org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] >>>> >>>> *Not HiveTableScan*!!! >>>> *So it dosn't execute my custom inputformat!* >>>> Why? How can it execute my custom inputformat? >>>> >>>> Thanks! >>>> >>>> >>>> >>> >> >