[
https://issues.apache.org/jira/browse/HIVE-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13728115#comment-13728115
]
Phabricator commented on HIVE-4051:
-----------------------------------
ashutoshc has requested changes to the revision "HIVE-4051 [jira] Hive's
metastore suffers from 1+N queries when querying partitions & is slow".
Initial review pass. Few comments.
INLINE COMMENTS
build.xml:703 This is a good change which I like, but lets do it in separate
jira : )
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:1654
ObjectStore is getting too big. Lets move this class in a separate file.
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:1681 I
havent verified in code, but as far as I remember nothing in code at the moment
uses CREATE_TIME, LAST_ACCESS_TIME, so we can drop these from select list.
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:1711 You
can initialize size of these treemaps with result.size()
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:1681 We
already know dbName and table name, why are you having them in select list?
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:1735 If
we do decide to keep create and last access time, can you check if they are
Integer, I though they are longs.
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:1776 Can
there ever be a case when coldId == null ? I am not able to imagine one, did
you see some failures ?
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:1826
Instead of *, its better to use ORDER here, because RDBMS which doesn't support
ORDER as column name would anyway be broken when schemas are created in it.
metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java:2268
This query is not encapsulated in transaction anymore?
metastore/src/java/org/apache/hadoop/hive/metastore/parser/ExpressionTree.java:379
We should be able to support for non-strings as well, right? Is it just that
you want to do that in later jira ?
REVISION DETAIL
https://reviews.facebook.net/D11805
BRANCH
HIVE-4051
ARCANIST PROJECT
hive
To: JIRA, ashutoshc, sershe
Cc: brock
> Hive's metastore suffers from 1+N queries when querying partitions & is slow
> ----------------------------------------------------------------------------
>
> Key: HIVE-4051
> URL: https://issues.apache.org/jira/browse/HIVE-4051
> Project: Hive
> Issue Type: Bug
> Components: Clients, Metastore
> Environment: RHEL 6.3 / EC2 C1.XL
> Reporter: Gopal V
> Assignee: Sergey Shelukhin
> Attachments: HIVE-4051.D11805.1.patch, HIVE-4051.D11805.2.patch,
> HIVE-4051.D11805.3.patch, HIVE-4051.D11805.4.patch, HIVE-4051.D11805.5.patch,
> HIVE-4051.D11805.6.patch
>
>
> Hive's query client takes a long time to initialize & start planning queries
> because of delays in creating all the MTable/MPartition objects.
> For a hive db with 1800 partitions, the metastore took 6-7 seconds to
> initialize - firing approximately 5900 queries to the mysql database.
> Several of those queries fetch exactly one row to create a single object on
> the client.
> The following 12 queries were repeated for each partition, generating a storm
> of SQL queries
> {code}
> 4 Query SELECT
> `A0`.`SD_ID`,`B0`.`INPUT_FORMAT`,`B0`.`IS_COMPRESSED`,`B0`.`IS_STOREDASSUBDIRECTORIES`,`B0`.`LOCATION`,`B0`.`NUM_BUCKETS`,`B0`.`OUTPUT_FORMAT`,`B0`.`SD_ID`
> FROM `PARTITIONS` `A0` LEFT OUTER JOIN `SDS` `B0` ON `A0`.`SD_ID` =
> `B0`.`SD_ID` WHERE `A0`.`PART_ID` = 3945
> 4 Query SELECT `A0`.`CD_ID`,`B0`.`CD_ID` FROM `SDS` `A0` LEFT OUTER JOIN
> `CDS` `B0` ON `A0`.`CD_ID` = `B0`.`CD_ID` WHERE `A0`.`SD_ID` =4871
> 4 Query SELECT COUNT(*) FROM `COLUMNS_V2` THIS WHERE THIS.`CD_ID`=1546
> AND THIS.`INTEGER_IDX`>=0
> 4 Query SELECT
> `A0`.`COMMENT`,`A0`.`COLUMN_NAME`,`A0`.`TYPE_NAME`,`A0`.`INTEGER_IDX` AS
> NUCORDER0 FROM `COLUMNS_V2` `A0` WHERE `A0`.`CD_ID` = 1546 AND
> `A0`.`INTEGER_IDX` >= 0 ORDER BY NUCORDER0
> 4 Query SELECT `A0`.`SERDE_ID`,`B0`.`NAME`,`B0`.`SLIB`,`B0`.`SERDE_ID`
> FROM `SDS` `A0` LEFT OUTER JOIN `SERDES` `B0` ON `A0`.`SERDE_ID` =
> `B0`.`SERDE_ID` WHERE `A0`.`SD_ID` =4871
> 4 Query SELECT COUNT(*) FROM `SORT_COLS` THIS WHERE THIS.`SD_ID`=4871 AND
> THIS.`INTEGER_IDX`>=0
> 4 Query SELECT `A0`.`COLUMN_NAME`,`A0`.`ORDER`,`A0`.`INTEGER_IDX` AS
> NUCORDER0 FROM `SORT_COLS` `A0` WHERE `A0`.`SD_ID` =4871 AND
> `A0`.`INTEGER_IDX` >= 0 ORDER BY NUCORDER0
> 4 Query SELECT COUNT(*) FROM `SKEWED_VALUES` THIS WHERE
> THIS.`SD_ID_OID`=4871 AND THIS.`INTEGER_IDX`>=0
> 4 Query SELECT 'org.apache.hadoop.hive.metastore.model.MStringList' AS
> NUCLEUS_TYPE,`A1`.`STRING_LIST_ID`,`A0`.`INTEGER_IDX` AS NUCORDER0 FROM
> `SKEWED_VALUES` `A0` INNER JOIN `SKEWED_STRING_LIST` `A1` ON
> `A0`.`STRING_LIST_ID_EID` = `A1`.`STRING_LIST_ID` WHERE `A0`.`SD_ID_OID`
> =4871 AND `A0`.`INTEGER_IDX` >= 0 ORDER BY NUCORDER0
> 4 Query SELECT COUNT(*) FROM `SKEWED_COL_VALUE_LOC_MAP` WHERE `SD_ID`
> =4871 AND `STRING_LIST_ID_KID` IS NOT NULL
> 4 Query SELECT 'org.apache.hadoop.hive.metastore.model.MStringList' AS
> NUCLEUS_TYPE,`A0`.`STRING_LIST_ID` FROM `SKEWED_STRING_LIST` `A0` INNER JOIN
> `SKEWED_COL_VALUE_LOC_MAP` `B0` ON `A0`.`STRING_LIST_ID` =
> `B0`.`STRING_LIST_ID_KID` WHERE `B0`.`SD_ID` =4871
> 4 Query SELECT `A0`.`STRING_LIST_ID_KID`,`A0`.`LOCATION` FROM
> `SKEWED_COL_VALUE_LOC_MAP` `A0` WHERE `A0`.`SD_ID` =4871 AND NOT
> (`A0`.`STRING_LIST_ID_KID` IS NULL)
> {code}
> This data is not detached or cached, so this operation is performed during
> every query plan for the partitions, even in the same hive client.
> The queries are automatically generated by JDO/DataNucleus which makes it
> nearly impossible to rewrite it into a single denormalized join operation &
> process it locally.
> Attempts to optimize this with JDO fetch-groups did not bear fruit in
> improving the query count.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira