Re: Review Request 67485: HIVE-19783 Retrieve only locations in HiveMetaStore.dropPartitionsAndGetLocations

Peter Vary via Review Board Thu, 14 Jun 2018 03:55:00 -0700


> On jún. 13, 2018, 4:50 du, Vihang Karajgaonkar wrote:
> > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
> > Line 2545 (original), 2538 (patched)
> > <https://reviews.apache.org/r/67485/diff/2/?file=2036273#file2036273line2545>
> >
> >     My concern here is that we are removing the batch processing from this 
> > method. While the memory footprint of this method has reduced since we are 
> > not retrieving the fully loaded partition objects, I am worried that it may 
> > still cause OOMs for very large tables. Do you have any testing results 
> > which shows that this implementation is not any worse than what we already 
> > have in terms of the memory footprint?


I was able to run tests with a HMS using 4G memory dropping 1 million 
partitions without problems (It was harder to create the test tables, than 
dropping them :) )

I think the typical size for partitionName is for a 5 level parititoned table 
is ~150 bytes, and the location is ~300 bytes (partition name, plus table 
location), which is around 500 bytes for partitions. The theortical maximum is 
partitionName 767 bytes, and location 4000 bytes.
Currently there are customers who are not able to drop tables with 100k 
partitions. For this number, the typical location map is 50M. The theortical 
maximum is for the map ~500M.

I think for a metastore where a table contains 100k partitions 50M of memory 
allocation should not cause a problem. These customers often have 64G of memory 
set for HMS.

Also we rutinely query every partition name for a table (see: 
PartitionIterator). If we have a 5 level partitioned table, then the memory 
pressure is in the range of this method, and we do not allow any other query 
run against this table.

I improved the change with your idea, so from now on getPartitionLocation will 
not return the locations which are parent for the base directory. So for 
typical managed tables it will return null for every partition thus the load 
will be raffly the same than the PartitionIterator.

If we decice we should query the partition locations in batches then we could 
do it in a follow-up jira:
- new configuration parameter - Like: 
metastore.batch.retrieve.table.partition.location.max = 10000
- modify getPartitionLocations to have input like partitionNames. We will have 
move to use getPartQueryWithParams which we have to check how it handles big 
numbers of partitionNames. 
- get the partition name list when dropping partitions, and getting the 
locations for batches.

What do you think? Is the possibility of memory problems in this case worth the 
extra complexity and risk?


> On jún. 13, 2018, 4:50 du, Vihang Karajgaonkar wrote:
> > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
> > Lines 2575 (patched)
> > <https://reviews.apache.org/r/67485/diff/2/?file=2036273#file2036273line2593>
> >
> >     Can we avoid creating a new List here by changing the method signature 
> > to get a Collection instead of List<string>?

I tried to do this, but in the end I had to change the interface of the 
Batchable.runBatched. I shied away from this, since the list interface might be 
important there.


> On jún. 13, 2018, 4:50 du, Vihang Karajgaonkar wrote:
> > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
> > Lines 2784-2785 (patched)
> > <https://reviews.apache.org/r/67485/diff/2/?file=2036274#file2036274line2784>
> >
> >     may be a future task for improvement. We should think of ways to reduce 
> > the duplicate strings here. Most of the partitions locations will have the 
> > same prefix in the their path location. Same with the partition keys part 
> > from the partname. Right now, may be if we just limit the number of rows in 
> > smaller batches somehow would do the trick.

This is a good idea, and not that big change. Also this can greatly reduce the 
memory pressure. Added this.


- Peter


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67485/#review204700
-----------------------------------------------------------


On jún. 7, 2018, 10:31 de, Peter Vary wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67485/
> -----------------------------------------------------------
> 
> (Updated jún. 7, 2018, 10:31 de)
> 
> 
> Review request for hive, Alexander Kolbasov and Vihang Karajgaonkar.
> 
> 
> Bugs: HIVE-19783
>     https://issues.apache.org/jira/browse/HIVE-19783
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Added a new getPartitionLocations method to the RawStore interface.
> 
> Implemented getPartitionLocations in ObjectStore using JDQL.
> Question: In CachedObjectStore: Shall I call rawStore.getPartitionLocations 
> or reimplement it using getPartitions?
> 
> Modified dropPartitionsAndGetLocations:
> - Instead of querying every partition data. Query only the locations using 
> the new interface method
> - Removed partKeys parameter which become unneccessary
> 
> 
> Diffs
> -----
> 
>   
> itests/hcatalog-unit/src/test/java/org/apache/hive/hcatalog/listener/DummyRawStoreFailEvent.java
>  ff97522 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
>  b9f5fb8 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
>  b3a8dd0 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/RawStore.java
>  f350aa9 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/cache/CachedStore.java
>  d9356b8 
>   
> standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java
>  8c3ada3 
>   
> standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
>  f98e8de 
> 
> 
> Diff: https://reviews.apache.org/r/67485/diff/2/
> 
> 
> Testing
> -------
> 
> Run the TestTablesCreateDropAlterTruncate test (partitioned table creation 
> and drop)
> 
> 
> Thanks,
> 
> Peter Vary
> 
>

Re: Review Request 67485: HIVE-19783 Retrieve only locations in HiveMetaStore.dropPartitionsAndGetLocations

Reply via email to