[
https://issues.apache.org/jira/browse/HIVE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170041#comment-14170041
]
Sushanth Sowmyan commented on HIVE-7368:
----------------------------------------
Hi Selina,
Yes, I would agree that the connection pool (or jdbc driver, since I've since
been able to see this happening a couple of times with DBCP as well) is
probably raising some sort of internal error that is being incorrectly read as
normal operation by DN, which results in a NSOE by the hive ObjectStore. I
definitely agree that this is the underlying error that we need to reproduce
and track down to fix.
In the case of a persistent remote metastore, I would agree that increasing the
size of the connection pools makes sense, and should be the way to go. I
generally do advise a larger pool, and always going through the metastore.
But in the case of parallel hive fatclients, the embedded metastore is
effectively single-threaded w.r.t to connections to the database, so I'm afraid
I don't yet understand how having a larger pool would help in this case. Could
you please expand on this bit?
(And yes, "datanucleus.connectionPool.testSQL=SELECT 1" is so that the overhead
of DN testing connectivity to the db is minimized - without that, DN creates a
bunch of deleteme* tables and drops them to test connectivity.)
> datanucleus sometimes returns an empty result instead of an error or data
> -------------------------------------------------------------------------
>
> Key: HIVE-7368
> URL: https://issues.apache.org/jira/browse/HIVE-7368
> Project: Hive
> Issue Type: Bug
> Components: Metastore
> Affects Versions: 0.12.0
> Reporter: Sushanth Sowmyan
>
> I investigated a scenario wherein a user needed to use a large number of
> concurrent hive clients doing simple DDL tasks, while not using a standalone
> metastore server. Say, for eg., each of them doing "drop table if exists
> tmp_blah_${i};"
> This would consistently fail stating that it could not create a db, which is
> a funny error to have when trying to drop a db "if exists". On digging in, it
> turned out that the error was a mistaken report, coming instead from an
> attempt by the embedded metastore attempting to create a "default" db when it
> did not exist. The funny thing being that the default db did exist, and the
> getDatabase call would return empty, rather than returning an error or
> returning a result. We could disable hive.metastore.checkForDefaultDb and the
> number of these reports would drastically fall, but that only moved the
> problem, and we'd get phantom reports from time to time of various other
> databases that existed that were being reported as non-existent.
> On digging further, parallelism seemed to be an important factor in whether
> or not hive was able to perform getDatabases without error. With about 20
> simultaneous processes, there seemed to be no errors whatsoever. At about 40
> simultaneous processes, at least 1 would consistently fail. At about 200,
> about 15-20 would consistently fail, in addition to taking a long time to run.
> I wrote a sample JDBC ping (actually a get_database mimic) utility to see
> whether the issue was with connecting from that machine to the database
> server, and this had no errors whatsoever up to 400 simultaneous processes.
> The mysql server in question was configured to serve up to 650 connections,
> and it seemed to be serving responses quickly and did not seem overloaded. We
> also disabled connection pooling in case that was exacerbating a connection
> availability issue with that many concurrent processes, each with an embedded
> metastore. That, especially in conjunction with disabling schema checking,
> and specifying a "datanucleus.connectionPool.testSQL=SELECT 1" did a fair
> amount for performance in this scenarios, but the errors (or rather, the
> null-result-successes when there shouldn't have been one) continued.
> On checking through hive again, if we modified hive to have datanucleus
> simply return a connection, with which we did a direct sql get database,
> there would not be any error, but if we tried to use jdo on datanucleus to
> construct a db object, we would get an empty result, so the issue seems to
> crop up in the jdo mapping.
> One of the biggest issues with this investigation, for me, was the difficulty
> of reproducibility. When trying to reproduce in a lab, we were unable to
> create a similar enough environment that caused the issue. Even in the
> client's environment, moving from RHEL5 to RHEL6 made the issue go away.
> Thus, we still have work to do on determining the underlying issue, I'm
> logging this issue to collect information on similar issues we discover so we
> can work towards nailing down the issue and then fixing it(in DN if need be)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)