On Wed, Jan 18, 2017 at 11:13 AM, Michael Allman <mich...@videoamp.com> wrote:
> What is the physical query plan after you set > spark.sql.hive.convertMetastoreParquet > to true? > Physical plan continas all the partition locations > > Michael > > On Jan 17, 2017, at 6:51 PM, Raju Bairishetti <r...@apache.org> wrote: > > Thanks Michael for the respopnse. > > > On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <mich...@videoamp.com> > wrote: > >> Hi Raju, >> >> I'm sorry this isn't working for you. I helped author this functionality >> and will try my best to help. >> >> First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to >> false? >> > I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did > not work for me without setting *spark.sql.hive.convertMetastoreParquet* > property. > > Can you link specifically to the jira issue or spark pr you referred to? >> The first thing I would try is setting spark.sql.hive.convertMetastoreParquet >> to true. Setting that to false might also explain why you're getting >> parquet decode errors. If you're writing your table data with Spark's >> parquet file writer and reading with Hive's parquet file reader, there may >> be an incompatibility accounting for the decode errors you're seeing. >> >> https://issues.apache.org/jira/browse/SPARK-6910 . My main motivation > is to avoid fetching all the partitions. We reverted > spark.sql.hive.convertMetastoreParquet setting to true to decoding > errors. After reverting this it is fetching all partiitons from the table. > > Can you reply with your table's Hive metastore schema, including partition >> schema? >> > col1 string > col2 string > year int > month int > day int > hour int > > # Partition Information > > # col_name data_type comment > > year int > > month int > > day int > > hour int > > venture string > >> >> > Where are the table's files located? >> > In hadoop. Under some user directory. > >> If you do a "show partitions <dbname>.<tablename>" in the spark-sql >> shell, does it show the partitions you expect to see? If not, run "msck >> repair table <dbname>.<tablename>". >> > Yes. It is listing the partitions > >> Cheers, >> >> Michael >> >> >> On Jan 17, 2017, at 12:02 AM, Raju Bairishetti <r...@apache.org> wrote: >> >> Had a high level look into the code. Seems getHiveQlPartitions method >> from HiveMetastoreCatalog is getting called irrespective of >> metastorePartitionPruning >> conf value. >> >> It should not fetch all partitions if we set metastorePartitionPruning to >> true (Default value for this is false) >> >> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = >> { >> val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { >> table.getPartitions(predicates) >> } else { >> allPartitions >> } >> >> ... >> >> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = >> client.getPartitionsByFilter(this, predicates) >> >> lazy val allPartitions = table.getAllPartitions >> >> But somehow getAllPartitions is getting called eventough after setting >> metastorePartitionPruning to true. >> >> Am I missing something or looking at wrong place? >> >> >> On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <r...@apache.org> >> wrote: >> >>> Hello, >>> >>> Spark sql is generating query plan with all partitions information >>> even though if we apply filters on partitions in the query. Due to >>> this, sparkdriver/hive metastore is hitting with OOM as each table is >>> with lots of partitions. >>> >>> We can confirm from hive audit logs that it tries to >>> *fetch all partitions* from hive metastore. >>> >>> 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit >>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x >>> cmd=get_partitions : db=xxxx tbl=xxxxx >>> >>> >>> Configured the following parameters in the spark conf to fix the above >>> issue(source: from spark-jira & github pullreq): >>> >>> *spark.sql.hive.convertMetastoreParquet false* >>> * spark.sql.hive.metastorePartitionPruning true* >>> >>> >>> * plan: rdf.explain* >>> * == Physical Plan ==* >>> HiveTableScan [rejection_reason#626], MetastoreRelation dbname, >>> tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = >>> 28),(hour#317 = 2),(venture#318 = DEFAULT)] >>> >>> * get_partitions_by_filter* method is called and fetching only >>> required partitions. >>> >>> But we are seeing parquetDecode errors in our applications >>> frequently after this. Looks like these decoding errors were because of >>> changing serde fromspark-builtin to hive serde. >>> >>> I feel like,* fixing query plan generation in the spark-sql* is the >>> right approach instead of forcing users to use hive serde. >>> >>> Is there any workaround/way to fix this issue? I would like to hear more >>> thoughts on this :) >>> >>> >>> On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <r...@apache.org> >>> wrote: >>> >>>> Had a high level look into the code. Seems getHiveQlPartitions method >>>> from HiveMetastoreCatalog is getting called irrespective of >>>> metastorePartitionPruning >>>> conf value. >>>> >>>> It should not fetch all partitions if we set metastorePartitionPruning >>>> to true (Default value for this is false) >>>> >>>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] >>>> = { >>>> val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { >>>> table.getPartitions(predicates) >>>> } else { >>>> allPartitions >>>> } >>>> >>>> ... >>>> >>>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = >>>> client.getPartitionsByFilter(this, predicates) >>>> >>>> lazy val allPartitions = table.getAllPartitions >>>> >>>> But somehow getAllPartitions is getting called eventough after setting >>>> metastorePartitionPruning to true. >>>> >>>> Am I missing something or looking at wrong place? >>>> >>>> >>>> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <r...@apache.org> >>>> wrote: >>>> >>>>> Waiting for suggestions/help on this... >>>>> >>>>> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <r...@apache.org> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> Spark sql is generating query plan with all partitions information >>>>>> even though if we apply filters on partitions in the query. Due to this, >>>>>> spark driver/hive metastore is hitting with OOM as each table is with >>>>>> lots >>>>>> of partitions. >>>>>> >>>>>> We can confirm from hive audit logs that it tries to *fetch all >>>>>> partitions* from hive metastore. >>>>>> >>>>>> 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: >>>>>> HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) - >>>>>> ugi=rajub ip=/x.x.x.x cmd=get_partitions : db=xxxx tbl=xxxxx >>>>>> >>>>>> >>>>>> Configured the following parameters in the spark conf to fix the >>>>>> above issue(source: from spark-jira & github pullreq): >>>>>> >>>>>> *spark.sql.hive.convertMetastoreParquet false* >>>>>> * spark.sql.hive.metastorePartitionPruning true* >>>>>> >>>>>> >>>>>> * plan: rdf.explain* >>>>>> * == Physical Plan ==* >>>>>> HiveTableScan [rejection_reason#626], MetastoreRelation >>>>>> dbname, tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = >>>>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)] >>>>>> >>>>>> * get_partitions_by_filter* method is called and fetching only >>>>>> required partitions. >>>>>> >>>>>> But we are seeing parquetDecode errors in our applications >>>>>> frequently after this. Looks like these decoding errors were because of >>>>>> changing serde from spark-builtin to hive serde. >>>>>> >>>>>> I feel like,* fixing query plan generation in the spark-sql* is the >>>>>> right approach instead of forcing users to use hive serde. >>>>>> >>>>>> Is there any workaround/way to fix this issue? I would like to hear >>>>>> more thoughts on this :) >>>>>> >>>>>> ------ >>>>>> Thanks, >>>>>> Raju Bairishetti, >>>>>> www.lazada.com >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ------ >>>>> Thanks, >>>>> Raju Bairishetti, >>>>> www.lazada.com >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> ------ >>>> Thanks, >>>> Raju Bairishetti, >>>> www.lazada.com >>>> >>> >>> >>> >>> -- >>> >>> ------ >>> Thanks, >>> Raju Bairishetti, >>> www.lazada.com >>> >> >> >> >> -- >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com >> >> >> > > > -- > > ------ > Thanks, > Raju Bairishetti, > www.lazada.com > > > -- ------ Thanks, Raju Bairishetti, www.lazada.com