[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

patcharee (JIRA) Fri, 16 Oct 2015 00:21:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960296#comment-14960296
 ]


patcharee commented on SPARK-11087:
-----------------------------------

[~zhazhan]

Below is my test. Please check. I tried to change 
"hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat 
[INFO] ORC pushdown predicate" as same as your result

2508  case class Contact(name: String, phone: String)
2509  case class Person(name: String, age: Int, contacts: Seq[Contact])
2510  val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { 
m => Contact(s"contact_$m", s"phone_$m") } )
2511  }
2512  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2513  
sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned")
2514  val peoplePartitioned = 
sqlContext.read.format("orc").load("peoplePartitioned")
2515   peoplePartitioned.registerTempTable("peoplePartitioned")

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL")
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:53 INFO PerfLogger: <PERFLOG method=OrcGetSplits 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:10:53 INFO PerfLogger: <PERFLOG method=OrcGetSplits 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 INFO PerfLogger: </PERFLOG method=OrcGetSplits 
start=1444979453032 end=1444979453038 duration=6 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:10:53 INFO PerfLogger: </PERFLOG method=OrcGetSplits 
start=1444979453032 end=1444979453038 duration=6 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
res5: Long = 1

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI")
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 INFO PerfLogger: <PERFLOG method=OrcGetSplits 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:11:19 INFO PerfLogger: <PERFLOG method=OrcGetSplits 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:11:19 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:11:19 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:11:19 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:11:19 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:11:19 INFO PerfLogger: </PERFLOG method=OrcGetSplits 
start=1444979479831 end=1444979479846 duration=15 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:11:19 INFO PerfLogger: </PERFLOG method=OrcGetSplits 
start=1444979479831 end=1444979479846 duration=15 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
res7: Long = 1

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "HYBRID")
15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID
15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID
15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID
15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:11:29 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:29 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:29 INFO PerfLogger: <PERFLOG method=OrcGetSplits 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:11:29 INFO PerfLogger: <PERFLOG method=OrcGetSplits 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:11:29 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:29 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:29 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:29 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:29 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:11:29 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:11:29 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:11:29 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:11:29 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:11:29 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:11:29 INFO PerfLogger: </PERFLOG method=OrcGetSplits 
start=1444979489785 end=1444979489789 duration=4 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/10/16 09:11:29 INFO PerfLogger: </PERFLOG method=OrcGetSplits 
start=1444979489785 end=1444979489789 duration=4 
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
res9: Long = 1



> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> ---------------------------------------------------------------------
>
>                 Key: SPARK-11087
>                 URL: https://issues.apache.org/jira/browse/SPARK-11087
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>         Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>            Reporter: patcharee
>            Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> ================================
> hive> describe formatted 4D;
> OK
> # col_name                    data_type               comment             
>                
> date                  int                                         
> hh                    int                                         
> x                     int                                         
> y                     int                                         
> height                float                                       
> u                     float                                       
> v                     float                                       
> w                     float                                       
> ph                    float                                       
> phb                   float                                       
> t                     float                                       
> p                     float                                       
> pb                    float                                       
> qvapor                float                                       
> qgraup                float                                       
> qnice                 float                                       
> qnrain                float                                       
> tke_pbl               float                                       
> el_pbl                float                                       
> qcloud                float                                       
>                
> # Partition Information                
> # col_name                    data_type               comment             
>                
> zone                  int                                         
> z                     int                                         
> year                  int                                         
> month                 int                                         
>                
> # Detailed Table Information           
> Database:             default                  
> Owner:                patcharee                
> CreateTime:           Thu Jul 09 16:46:54 CEST 2015    
> LastAccessTime:       UNKNOWN                  
> Protect Mode:         None                     
> Retention:            0                        
> Location:             hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D       
>  
> Table Type:           EXTERNAL_TABLE           
> Table Parameters:              
>       EXTERNAL                TRUE                
>       comment                 this table is imported from rwf_data/*/wrf/*
>       last_modified_by        patcharee           
>       last_modified_time      1439806692          
>       orc.compress            ZLIB                
>       transient_lastDdlTime   1439806692          
>                
> # Storage Information          
> SerDe Library:        org.apache.hadoop.hive.ql.io.orc.OrcSerde        
> InputFormat:          org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat:         org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat        
>  
> Compressed:           No                       
> Num Buckets:          -1                       
> Bucket Columns:       []                       
> Sort Columns:         []                       
> Storage Desc Params:           
>       serialization.format    1                   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> ================================
> Data was inserted into this table by another spark job>
> df.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("4D")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

Reply via email to