[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960296#comment-14960296 ]
patcharee commented on SPARK-11087: ----------------------------------- [~zhazhan] Below is my test. Please check. I tried to change "hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat [INFO] ORC pushdown predicate" as same as your result 2508 case class Contact(name: String, phone: String) 2509 case class Person(name: String, age: Int, contacts: Seq[Contact]) 2510 val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") } ) 2511 } 2512 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2513 sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned") 2514 val peoplePartitioned = sqlContext.read.format("orc").load("peoplePartitioned") 2515 peoplePartitioned.registerTempTable("peoplePartitioned") scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL") 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:53 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:10:53 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1444979453032 end=1444979453038 duration=6 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:10:53 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1444979453032 end=1444979453038 duration=6 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> res5: Long = 1 scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI") 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:11:19 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:11:19 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:11:19 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:11:19 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:11:19 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:11:19 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1444979479831 end=1444979479846 duration=15 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:11:19 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1444979479831 end=1444979479846 duration=15 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> res7: Long = 1 scala> sqlContext.setConf("hive.exec.orc.split.strategy", "HYBRID") 15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID 15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID 15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID 15/10/16 09:11:27 DEBUG VariableSubstitution: Substitution is on: HYBRID scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:11:29 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:29 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:29 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:11:29 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:11:29 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:29 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:29 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:29 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:29 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:11:29 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:11:29 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:11:29 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:11:29 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:11:29 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:11:29 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1444979489785 end=1444979489789 duration=4 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/10/16 09:11:29 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1444979489785 end=1444979489789 duration=4 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> res9: Long = 1 > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > --------------------------------------------------------------------- > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 > Reporter: patcharee > Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > ================================ > hive> describe formatted 4D; > OK > # col_name data_type comment > > date int > hh int > x int > y int > height float > u float > v float > w float > ph float > phb float > t float > p float > pb float > qvapor float > qgraup float > qnice float > qnrain float > tke_pbl float > el_pbl float > qcloud float > > # Partition Information > # col_name data_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner: patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNAL TRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_by patcharee > last_modified_time 1439806692 > orc.compress ZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format 1 > Time taken: 0.388 seconds, Fetched: 58 row(s) > ================================ > Data was inserted into this table by another spark job> > df.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("4D") -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org