Can you paste the actual query plan here, please?
> On Jan 17, 2017, at 7:38 PM, Raju Bairishetti <r...@apache.org> wrote: > > > On Wed, Jan 18, 2017 at 11:13 AM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > What is the physical query plan after you set > spark.sql.hive.convertMetastoreParquet to true? > Physical plan continas all the partition locations > > Michael > >> On Jan 17, 2017, at 6:51 PM, Raju Bairishetti <r...@apache.org >> <mailto:r...@apache.org>> wrote: >> >> Thanks Michael for the respopnse. >> >> >> On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <mich...@videoamp.com >> <mailto:mich...@videoamp.com>> wrote: >> Hi Raju, >> >> I'm sorry this isn't working for you. I helped author this functionality and >> will try my best to help. >> >> First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to >> false? >> I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did not >> work for me without setting spark.sql.hive.convertMetastoreParquet >> property. >> >> Can you link specifically to the jira issue or spark pr you referred to? The >> first thing I would try is setting spark.sql.hive.convertMetastoreParquet to >> true. Setting that to false might also explain why you're getting parquet >> decode errors. If you're writing your table data with Spark's parquet file >> writer and reading with Hive's parquet file reader, there may be an >> incompatibility accounting for the decode errors you're seeing. >> >> https://issues.apache.org/jira/browse/SPARK-6910 >> <https://issues.apache.org/jira/browse/SPARK-6910> . My main motivation is >> to avoid fetching all the partitions. We reverted >> spark.sql.hive.convertMetastoreParquet setting to true to decoding errors. >> After reverting this it is fetching all partiitons from the table. >> >> Can you reply with your table's Hive metastore schema, including partition >> schema? >> col1 string >> col2 string >> year int >> month int >> day int >> hour int >> # Partition Information >> >> # col_name data_type comment >> >> year int >> >> month int >> >> day int >> >> hour int >> >> venture string >> >> >> Where are the table's files located? >> In hadoop. Under some user directory. >> If you do a "show partitions <dbname>.<tablename>" in the spark-sql shell, >> does it show the partitions you expect to see? If not, run "msck repair >> table <dbname>.<tablename>". >> Yes. It is listing the partitions >> Cheers, >> >> Michael >> >> >>> On Jan 17, 2017, at 12:02 AM, Raju Bairishetti <r...@apache.org >>> <mailto:r...@apache.org>> wrote: >>> >>> Had a high level look into the code. Seems getHiveQlPartitions method from >>> HiveMetastoreCatalog is getting called irrespective of >>> metastorePartitionPruning conf value. >>> >>> It should not fetch all partitions if we set metastorePartitionPruning to >>> true (Default value for this is false) >>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] >>> = { >>> val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { >>> table.getPartitions(predicates) >>> } else { >>> allPartitions >>> } >>> ... >>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = >>> client.getPartitionsByFilter(this, predicates) >>> lazy val allPartitions = table.getAllPartitions >>> But somehow getAllPartitions is getting called eventough after setting >>> metastorePartitionPruning to true. >>> Am I missing something or looking at wrong place? >>> >>> On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <r...@apache.org >>> <mailto:r...@apache.org>> wrote: >>> Hello, >>> >>> Spark sql is generating query plan with all partitions information even >>> though if we apply filters on partitions in the query. Due to this, >>> sparkdriver/hive metastore is hitting with OOM as each table is with lots >>> of partitions. >>> >>> We can confirm from hive audit logs that it tries to fetch all partitions >>> from hive metastore. >>> >>> 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit >>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x >>> cmd=get_partitions : db=xxxx tbl=xxxxx >>> >>> >>> Configured the following parameters in the spark conf to fix the above >>> issue(source: from spark-jira & github pullreq): >>> spark.sql.hive.convertMetastoreParquet false >>> spark.sql.hive.metastorePartitionPruning true >>> >>> plan: rdf.explain >>> == Physical Plan == >>> HiveTableScan [rejection_reason#626], MetastoreRelation dbname, >>> tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = >>> 28),(hour#317 = 2),(venture#318 = DEFAULT)] >>> >>> get_partitions_by_filter method is called and fetching only required >>> partitions. >>> >>> But we are seeing parquetDecode errors in our applications frequently >>> after this. Looks like these decoding errors were because of changing serde >>> fromspark-builtin to hive serde. >>> >>> I feel like, fixing query plan generation in the spark-sql is the right >>> approach instead of forcing users to use hive serde. >>> >>> Is there any workaround/way to fix this issue? I would like to hear more >>> thoughts on this :) >>> >>> >>> On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <r...@apache.org >>> <mailto:r...@apache.org>> wrote: >>> Had a high level look into the code. Seems getHiveQlPartitions method from >>> HiveMetastoreCatalog is getting called irrespective of >>> metastorePartitionPruning conf value. >>> >>> It should not fetch all partitions if we set metastorePartitionPruning to >>> true (Default value for this is false) >>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] >>> = { >>> val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { >>> table.getPartitions(predicates) >>> } else { >>> allPartitions >>> } >>> ... >>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = >>> client.getPartitionsByFilter(this, predicates) >>> lazy val allPartitions = table.getAllPartitions >>> But somehow getAllPartitions is getting called eventough after setting >>> metastorePartitionPruning to true. >>> Am I missing something or looking at wrong place? >>> >>> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <r...@apache.org >>> <mailto:r...@apache.org>> wrote: >>> Waiting for suggestions/help on this... >>> >>> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <r...@apache.org >>> <mailto:r...@apache.org>> wrote: >>> Hello, >>> >>> Spark sql is generating query plan with all partitions information even >>> though if we apply filters on partitions in the query. Due to this, spark >>> driver/hive metastore is hitting with OOM as each table is with lots of >>> partitions. >>> >>> We can confirm from hive audit logs that it tries to fetch all partitions >>> from hive metastore. >>> >>> 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit >>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x >>> cmd=get_partitions : db=xxxx tbl=xxxxx >>> >>> >>> Configured the following parameters in the spark conf to fix the above >>> issue(source: from spark-jira & github pullreq): >>> spark.sql.hive.convertMetastoreParquet false >>> spark.sql.hive.metastorePartitionPruning true >>> >>> plan: rdf.explain >>> == Physical Plan == >>> HiveTableScan [rejection_reason#626], MetastoreRelation dbname, >>> tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = >>> 28),(hour#317 = 2),(venture#318 = DEFAULT)] >>> >>> get_partitions_by_filter method is called and fetching only required >>> partitions. >>> >>> But we are seeing parquetDecode errors in our applications frequently >>> after this. Looks like these decoding errors were because of changing serde >>> from spark-builtin to hive serde. >>> >>> I feel like, fixing query plan generation in the spark-sql is the right >>> approach instead of forcing users to use hive serde. >>> >>> Is there any workaround/way to fix this issue? I would like to hear more >>> thoughts on this :) >>> >>> ------ >>> Thanks, >>> Raju Bairishetti, >>> www.lazada.com <http://www.lazada.com/> >>> >>> >>> -- >>> >>> ------ >>> Thanks, >>> Raju Bairishetti, >>> www.lazada.com <http://www.lazada.com/> >>> >>> >>> -- >>> >>> ------ >>> Thanks, >>> Raju Bairishetti, >>> www.lazada.com <http://www.lazada.com/> >>> >>> >>> -- >>> >>> ------ >>> Thanks, >>> Raju Bairishetti, >>> www.lazada.com <http://www.lazada.com/> >>> >>> >>> -- >>> >>> ------ >>> Thanks, >>> Raju Bairishetti, >>> www.lazada.com <http://www.lazada.com/> >> >> >> >> -- >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com <http://www.lazada.com/> > > > > -- > > ------ > Thanks, > Raju Bairishetti, > www.lazada.com <http://www.lazada.com/>