> On jún. 13, 2018, 4:50 du, Vihang Karajgaonkar wrote: > > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java > > Line 2545 (original), 2538 (patched) > > <https://reviews.apache.org/r/67485/diff/2/?file=2036273#file2036273line2545> > > > > My concern here is that we are removing the batch processing from this > > method. While the memory footprint of this method has reduced since we are > > not retrieving the fully loaded partition objects, I am worried that it may > > still cause OOMs for very large tables. Do you have any testing results > > which shows that this implementation is not any worse than what we already > > have in terms of the memory footprint?
I was able to run tests with a HMS using 4G memory dropping 1 million partitions without problems (It was harder to create the test tables, than dropping them :) ) I think the typical size for partitionName is for a 5 level parititoned table is ~150 bytes, and the location is ~300 bytes (partition name, plus table location), which is around 500 bytes for partitions. The theortical maximum is partitionName 767 bytes, and location 4000 bytes. Currently there are customers who are not able to drop tables with 100k partitions. For this number, the typical location map is 50M. The theortical maximum is for the map ~500M. I think for a metastore where a table contains 100k partitions 50M of memory allocation should not cause a problem. These customers often have 64G of memory set for HMS. Also we rutinely query every partition name for a table (see: PartitionIterator). If we have a 5 level partitioned table, then the memory pressure is in the range of this method, and we do not allow any other query run against this table. I improved the change with your idea, so from now on getPartitionLocation will not return the locations which are parent for the base directory. So for typical managed tables it will return null for every partition thus the load will be raffly the same than the PartitionIterator. If we decice we should query the partition locations in batches then we could do it in a follow-up jira: - new configuration parameter - Like: metastore.batch.retrieve.table.partition.location.max = 10000 - modify getPartitionLocations to have input like partitionNames. We will have move to use getPartQueryWithParams which we have to check how it handles big numbers of partitionNames. - get the partition name list when dropping partitions, and getting the locations for batches. What do you think? Is the possibility of memory problems in this case worth the extra complexity and risk? > On jún. 13, 2018, 4:50 du, Vihang Karajgaonkar wrote: > > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java > > Lines 2575 (patched) > > <https://reviews.apache.org/r/67485/diff/2/?file=2036273#file2036273line2593> > > > > Can we avoid creating a new List here by changing the method signature > > to get a Collection instead of List<string>? I tried to do this, but in the end I had to change the interface of the Batchable.runBatched. I shied away from this, since the list interface might be important there. > On jún. 13, 2018, 4:50 du, Vihang Karajgaonkar wrote: > > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java > > Lines 2784-2785 (patched) > > <https://reviews.apache.org/r/67485/diff/2/?file=2036274#file2036274line2784> > > > > may be a future task for improvement. We should think of ways to reduce > > the duplicate strings here. Most of the partitions locations will have the > > same prefix in the their path location. Same with the partition keys part > > from the partname. Right now, may be if we just limit the number of rows in > > smaller batches somehow would do the trick. This is a good idea, and not that big change. Also this can greatly reduce the memory pressure. Added this. - Peter ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67485/#review204700 ----------------------------------------------------------- On jún. 7, 2018, 10:31 de, Peter Vary wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67485/ > ----------------------------------------------------------- > > (Updated jún. 7, 2018, 10:31 de) > > > Review request for hive, Alexander Kolbasov and Vihang Karajgaonkar. > > > Bugs: HIVE-19783 > https://issues.apache.org/jira/browse/HIVE-19783 > > > Repository: hive-git > > > Description > ------- > > Added a new getPartitionLocations method to the RawStore interface. > > Implemented getPartitionLocations in ObjectStore using JDQL. > Question: In CachedObjectStore: Shall I call rawStore.getPartitionLocations > or reimplement it using getPartitions? > > Modified dropPartitionsAndGetLocations: > - Instead of querying every partition data. Query only the locations using > the new interface method > - Removed partKeys parameter which become unneccessary > > > Diffs > ----- > > > itests/hcatalog-unit/src/test/java/org/apache/hive/hcatalog/listener/DummyRawStoreFailEvent.java > ff97522 > > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java > b9f5fb8 > > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java > b3a8dd0 > > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/RawStore.java > f350aa9 > > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/cache/CachedStore.java > d9356b8 > > standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java > 8c3ada3 > > standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java > f98e8de > > > Diff: https://reviews.apache.org/r/67485/diff/2/ > > > Testing > ------- > > Run the TestTablesCreateDropAlterTruncate test (partitioned table creation > and drop) > > > Thanks, > > Peter Vary > >