[jira] [Updated] (HIVE-7368) datanucleus sometimes returns an empty result instead of an error or data

Sushanth Sowmyan (JIRA) Tue, 08 Jul 2014 15:14:27 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sushanth Sowmyan updated HIVE-7368:
-----------------------------------

    Description: 
I investigated a scenario wherein a user needed to use a large number of 
concurrent hive clients doing simple DDL tasks, while not using a standalone 
metastore server. Say, for eg., each of them doing "drop table if exists 
tmp_blah_${i};"

This would consistently fail stating that it could not create a db, which is a 
funny error to have when trying to drop a db "if exists". On digging in, it 
turned out that the error was a mistaken report, coming instead from an attempt 
by the embedded metastore attempting to create a "default" db when it did not 
exist. The funny thing being that the default db did exist, and the getDatabase 
call would return empty, rather than returning an error or returning a result. 
We could disable hive.metastore.checkForDefaultDb and the number of these 
reports would drastically fall, but that only moved the problem, and we'd get 
phantom reports from time to time of various other databases that existed that 
were being reported as non-existent.

On digging further, parallelism seemed to be an important factor in whether or 
not hive was able to perform getDatabases without error. With about 20 
simultaneous processes, there seemed to be no errors whatsoever. At about 40 
simultaneous processes, at least 1 would consistently fail. At about 200, about 
15-20 would consistently fail, in addition to taking a long time to run.

I wrote a sample JDBC ping (actually a get_database mimic) utility to see 
whether the issue was with connecting from that machine to the database server, 
and this had no errors whatsoever up to 400 simultaneous processes. The mysql 
server in question was configured to serve up to 650 connections, and it seemed 
to be serving responses quickly and did not seem overloaded. We also disabled 
connection pooling in case that was exacerbating a connection availability 
issue with that many concurrent processes, each with an embedded metastore. 
That, especially in conjunction with disabling schema checking, and specifying 
a "datanucleus.connectionPool.testSQL=SELECT 1" did a fair amount for 
performance in this scenarios, but the errors (or rather, the 
null-result-successes when there shouldn't have been one) continued.

On checking through hive again, if we modified hive to have datanucleus simply 
return a connection, with which we did a direct sql get database, there would 
not be any error, but if we tried to use jdo on datanucleus to construct a db 
object, we would get an empty result, so the issue seems to crop up in the jdo 
mapping.

One of the biggest issues with this investigation, for me, was the difficulty 
of reproducibility. When trying to reproduce in a lab, we were unable to create 
a similar enough environment that caused the issue. Even in the client's 
environment, moving from RHEL5 to RHEL6 made the issue go away.

Thus, we still have work to do on determining the underlying issue, I'm logging 
this issue to collect information on similar issues we discover so we can work 
towards nailing down the issue and then fixing it(in DN if need be)

  was:
I investigated a scenario wherein a user needed to use a large number of 
concurrent hive clients doing simple DDL tasks, while not using a standalone 
metastore server. Say, for eg., each of them doing "drop table if exists 
tmp_blah_${i};"

This would consistently fail stating that it could not create a db, which is a 
funny error to have when trying to drop a db "if exists". On digging in, it 
turned out that the error was a mistaken report, coming instead from an attempt 
by the embedded metastore attempting to create a "default" db when it did not 
exist. The funny thing being that the default db did exist, and the getDatabase 
call would return empty, rather than returning an error or returning a result. 
We could disable hive.metastore.checkForDefaultDb and the number of these 
reports would drastically fall, but that only moved the problem, and we'd get 
phantom reports from time to time of various other databases that existed that 
were being reported as non-existent.

On digging further, parallelism seemed to be an important factor in whether or 
not hive was able to perform getDatabases without error. With about 20 
simultaneous processes, there seemed to be no errors whatsoever. At about 40 
simultaneous processes, at least 1 would consistently fail. At about 200, about 
15-20 would consistently fail, in addition to taking a long time to run.

I wrote a sample JDBC ping (actually a get_database mimic) utility to see 
whether the issue was with connecting from that machine to the database server, 
and this had no errors whatsoever up to 400 simultaneous processes. The mysql 
server in question was configured to serve up to 650 connections, and it seemed 
to be serving responses quickly and did not seem overloaded. We also disabled 
connection pooling in case that was exacerbating a connection availability 
issue with that many concurrent processes, each with an embedded metastore. 
That, especially in conjunction with disabling schema checking, and specifying 
a "datanucleus.connectionPool.testSQL=SELECT 1" did a fair amount for 
performance in this scenarios, but the errors (or rather, the 
null-result-successes when there shouldn't have been one) continued.

On checking through hive again, if we modified hive to have datanucleus simply 
return a connection, with which we did a direct sql get database, there would 
not be any error, but if we tried to use jdo on datanucleus to construct a db 
object, we would get an empty result.

One of the biggest issues with this investigation, for me, was the difficulty 
of reproducibility. When trying to reproduce in a lab, we were unable to create 
a similar enough environment that caused the issue. Even in the client's 
environment, moving from RHEL5 to RHEL6 made the issue go away.

Thus, we still have work to do on determining the underlying issue, I'm logging 
this issue to collect information on similar issues we discover so we can work 
towards nailing down the issue and then fixing it(in DN if need be)


> datanucleus sometimes returns an empty result instead of an error or data
> -------------------------------------------------------------------------
>
>                 Key: HIVE-7368
>                 URL: https://issues.apache.org/jira/browse/HIVE-7368
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 0.12.0
>            Reporter: Sushanth Sowmyan
>
> I investigated a scenario wherein a user needed to use a large number of 
> concurrent hive clients doing simple DDL tasks, while not using a standalone 
> metastore server. Say, for eg., each of them doing "drop table if exists 
> tmp_blah_${i};"
> This would consistently fail stating that it could not create a db, which is 
> a funny error to have when trying to drop a db "if exists". On digging in, it 
> turned out that the error was a mistaken report, coming instead from an 
> attempt by the embedded metastore attempting to create a "default" db when it 
> did not exist. The funny thing being that the default db did exist, and the 
> getDatabase call would return empty, rather than returning an error or 
> returning a result. We could disable hive.metastore.checkForDefaultDb and the 
> number of these reports would drastically fall, but that only moved the 
> problem, and we'd get phantom reports from time to time of various other 
> databases that existed that were being reported as non-existent.
> On digging further, parallelism seemed to be an important factor in whether 
> or not hive was able to perform getDatabases without error. With about 20 
> simultaneous processes, there seemed to be no errors whatsoever. At about 40 
> simultaneous processes, at least 1 would consistently fail. At about 200, 
> about 15-20 would consistently fail, in addition to taking a long time to run.
> I wrote a sample JDBC ping (actually a get_database mimic) utility to see 
> whether the issue was with connecting from that machine to the database 
> server, and this had no errors whatsoever up to 400 simultaneous processes. 
> The mysql server in question was configured to serve up to 650 connections, 
> and it seemed to be serving responses quickly and did not seem overloaded. We 
> also disabled connection pooling in case that was exacerbating a connection 
> availability issue with that many concurrent processes, each with an embedded 
> metastore. That, especially in conjunction with disabling schema checking, 
> and specifying a "datanucleus.connectionPool.testSQL=SELECT 1" did a fair 
> amount for performance in this scenarios, but the errors (or rather, the 
> null-result-successes when there shouldn't have been one) continued.
> On checking through hive again, if we modified hive to have datanucleus 
> simply return a connection, with which we did a direct sql get database, 
> there would not be any error, but if we tried to use jdo on datanucleus to 
> construct a db object, we would get an empty result, so the issue seems to 
> crop up in the jdo mapping.
> One of the biggest issues with this investigation, for me, was the difficulty 
> of reproducibility. When trying to reproduce in a lab, we were unable to 
> create a similar enough environment that caused the issue. Even in the 
> client's environment, moving from RHEL5 to RHEL6 made the issue go away.
> Thus, we still have work to do on determining the underlying issue, I'm 
> logging this issue to collect information on similar issues we discover so we 
> can work towards nailing down the issue and then fixing it(in DN if need be)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-7368) datanucleus sometimes returns an empty result instead of an error or data

Reply via email to