Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Raju Bairishetti Tue, 17 Jan 2017 19:39:52 -0800

On Wed, Jan 18, 2017 at 11:13 AM, Michael Allman <[email protected]>
wrote:


> What is the physical query plan after you set 
> spark.sql.hive.convertMetastoreParquet
> to true?
>
Physical plan continas all the partition locations

>
> Michael
>
> On Jan 17, 2017, at 6:51 PM, Raju Bairishetti <[email protected]> wrote:
>
> Thanks Michael for the respopnse.
>
>
> On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <[email protected]>
> wrote:
>
>> Hi Raju,
>>
>> I'm sorry this isn't working for you. I helped author this functionality
>> and will try my best to help.
>>
>> First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to
>> false?
>>
> I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did
> not work for me without  setting *spark.sql.hive.convertMetastoreParquet*
> property.
>
> Can you link specifically to the jira issue or spark pr you referred to?
>> The first thing I would try is setting spark.sql.hive.convertMetastoreParquet
>> to true. Setting that to false might also explain why you're getting
>> parquet decode errors. If you're writing your table data with Spark's
>> parquet file writer and reading with Hive's parquet file reader, there may
>> be an incompatibility accounting for the decode errors you're seeing.
>>
>>  https://issues.apache.org/jira/browse/SPARK-6910 . My main motivation
> is to avoid fetching all the partitions. We reverted
> spark.sql.hive.convertMetastoreParquet  setting to true to decoding
> errors. After reverting this it is fetching all partiitons from the table.
>
> Can you reply with your table's Hive metastore schema, including partition
>> schema?
>>
>      col1 string
>      col2 string
>      year int
>      month int
>      day int
>      hour int
>
> # Partition Information
>
> # col_name            data_type           comment
>
> year  int
>
> month int
>
> day int
>
> hour int
>
> venture string
>
>>
>>
> Where are the table's files located?
>>
> In hadoop. Under some user directory.
>
>> If you do a "show partitions <dbname>.<tablename>" in the spark-sql
>> shell, does it show the partitions you expect to see? If not, run "msck
>> repair table <dbname>.<tablename>".
>>
> Yes. It is listing the partitions
>
>> Cheers,
>>
>> Michael
>>
>>
>> On Jan 17, 2017, at 12:02 AM, Raju Bairishetti <[email protected]> wrote:
>>
>> Had a high level look into the code. Seems getHiveQlPartitions  method
>> from HiveMetastoreCatalog is getting called irrespective of 
>> metastorePartitionPruning
>> conf value.
>>
>>  It should not fetch all partitions if we set metastorePartitionPruning to
>> true (Default value for this is false)
>>
>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = 
>> {
>>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>>     table.getPartitions(predicates)
>>   } else {
>>     allPartitions
>>   }
>>
>> ...
>>
>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>>   client.getPartitionsByFilter(this, predicates)
>>
>> lazy val allPartitions = table.getAllPartitions
>>
>> But somehow getAllPartitions is getting called eventough after setting 
>> metastorePartitionPruning to true.
>>
>> Am I missing something or looking at wrong place?
>>
>>
>> On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>>    Spark sql is generating query plan with all partitions information
>>> even though if we apply filters on partitions in the query.  Due to
>>> this, sparkdriver/hive metastore is hitting with OOM as each table is
>>> with lots of partitions.
>>>
>>> We can confirm from hive audit logs that it tries to
>>> *fetch all partitions* from hive metastore.
>>>
>>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit
>>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub    ip=/x.x.x.x
>>> cmd=get_partitions : db=xxxx tbl=xxxxx
>>>
>>>
>>> Configured the following parameters in the spark conf to fix the above
>>> issue(source: from spark-jira & github pullreq):
>>>
>>> *spark.sql.hive.convertMetastoreParquet   false*
>>> *    spark.sql.hive.metastorePartitionPruning   true*
>>>
>>>
>>> *   plan:  rdf.explain*
>>> *   == Physical Plan ==*
>>>        HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
>>> tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>>
>>> *    get_partitions_by_filter* method is called and fetching only
>>> required partitions.
>>>
>>>     But we are seeing parquetDecode errors in our applications
>>> frequently after this. Looks like these decoding errors were because of
>>> changing serde fromspark-builtin to hive serde.
>>>
>>> I feel like,* fixing query plan generation in the spark-sql* is the
>>> right approach instead of forcing users to use hive serde.
>>>
>>> Is there any workaround/way to fix this issue? I would like to hear more
>>> thoughts on this :)
>>>
>>>
>>> On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <[email protected]>
>>> wrote:
>>>
>>>> Had a high level look into the code. Seems getHiveQlPartitions  method
>>>> from HiveMetastoreCatalog is getting called irrespective of 
>>>> metastorePartitionPruning
>>>> conf value.
>>>>
>>>>  It should not fetch all partitions if we set metastorePartitionPruning
>>>>  to true (Default value for this is false)
>>>>
>>>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] 
>>>> = {
>>>>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>>>>     table.getPartitions(predicates)
>>>>   } else {
>>>>     allPartitions
>>>>   }
>>>>
>>>> ...
>>>>
>>>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>>>>   client.getPartitionsByFilter(this, predicates)
>>>>
>>>> lazy val allPartitions = table.getAllPartitions
>>>>
>>>> But somehow getAllPartitions is getting called eventough after setting 
>>>> metastorePartitionPruning to true.
>>>>
>>>> Am I missing something or looking at wrong place?
>>>>
>>>>
>>>> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <[email protected]>
>>>> wrote:
>>>>
>>>>> Waiting for suggestions/help on this...
>>>>>
>>>>> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>    Spark sql is generating query plan with all partitions information
>>>>>> even though if we apply filters on partitions in the query.  Due to this,
>>>>>> spark driver/hive metastore is hitting with OOM as each table is with 
>>>>>> lots
>>>>>> of partitions.
>>>>>>
>>>>>> We can confirm from hive audit logs that it tries to *fetch all
>>>>>> partitions* from hive metastore.
>>>>>>
>>>>>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]:
>>>>>> HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) -
>>>>>> ugi=rajub    ip=/x.x.x.x   cmd=get_partitions : db=xxxx tbl=xxxxx
>>>>>>
>>>>>>
>>>>>> Configured the following parameters in the spark conf to fix the
>>>>>> above issue(source: from spark-jira & github pullreq):
>>>>>>
>>>>>> *spark.sql.hive.convertMetastoreParquet   false*
>>>>>> *    spark.sql.hive.metastorePartitionPruning   true*
>>>>>>
>>>>>>
>>>>>> *   plan:  rdf.explain*
>>>>>> *   == Physical Plan ==*
>>>>>>        HiveTableScan [rejection_reason#626], MetastoreRelation
>>>>>> dbname, tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
>>>>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>>>>>
>>>>>> *    get_partitions_by_filter* method is called and fetching only
>>>>>> required partitions.
>>>>>>
>>>>>>     But we are seeing parquetDecode errors in our applications
>>>>>> frequently after this. Looks like these decoding errors were because of
>>>>>> changing serde from spark-builtin to hive serde.
>>>>>>
>>>>>> I feel like,* fixing query plan generation in the spark-sql* is the
>>>>>> right approach instead of forcing users to use hive serde.
>>>>>>
>>>>>> Is there any workaround/way to fix this issue? I would like to hear
>>>>>> more thoughts on this :)
>>>>>>
>>>>>> ------
>>>>>> Thanks,
>>>>>> Raju Bairishetti,
>>>>>> www.lazada.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ------
>>>>> Thanks,
>>>>> Raju Bairishetti,
>>>>> www.lazada.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ------
>>> Thanks,
>>> Raju Bairishetti,
>>> www.lazada.com
>>>
>>
>>
>>
>> --
>>
>> ------
>> Thanks,
>> Raju Bairishetti,
>> www.lazada.com
>>
>>
>>
>
>
> --
>
> ------
> Thanks,
> Raju Bairishetti,
> www.lazada.com
>
>
>


-- 

------
Thanks,
Raju Bairishetti,
www.lazada.com

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Reply via email to