Re: Creating a Hive table through Spark and potential locking issue (a bug)

Eugene Koifman Wed, 08 Jun 2016 16:43:52 -0700

yes, I assumed that you had Acid enabled.  I don’t think what I said would be 
true w/o this.


From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Wednesday, June 8, 2016 at 4:06 PM
To: user <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: Creating a Hive table through Spark and potential locking issue (a 
bug)

BTW

DbTxnManager is set as well


  <property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
    <description>
      Set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager as part of turning 
on Hive
      transactions, which also requires appropriate settings for 
hive.compactor.initiator.on,
      hive.compactor.worker.threads, hive.support.concurrency (true), 
hive.enforce.bucketing
      (true), and hive.exec.dynamic.partition.mode (nonstrict).
      The default DummyTxnManager replicates pre-Hive-0.13 behavior and provides
      no transactions.
    </description>
  </property>



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 8 June 2016 at 23:52, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
OK this seems to work.


  1.  Create the target table first
  2.   Populate afterwards

 I first created the target table with

hive> create table test.dummy as select * from oraclehadoop.dummy where 1 = 2;

 Then did  INSERT/SELECT and tried to drop the target table when DML 
(INSERT/SELECT) was going on

Now the process 6856 (drop table ..)  is waiting for the locks to be released 
which is correct


Lock ID Database        Table   Partition       State   Type    Transaction ID  
Last Hearbeat   Acquired At     User    Hostname
6855    test    dummy   NULL    ACQUIRED        SHARED_READ     NULL    
1465425703092   1465425703054   hduser  rhes564
6855    oraclehadoop    dummy   NULL    ACQUIRED        SHARED_READ     NULL    
1465425703092   1465425703056   hduser  rhes564
6856    test    dummy   NULL    WAITING EXCLUSIVE       NULL    1465425820073   
NULL    hduser  rhes564

Sounds like with Hive there is the issue with DDL + DML locks applied in a 
single transaction i.e. --> create table A as select * from b

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 8 June 2016 at 23:35, Eugene Koifman 
<ekoif...@hortonworks.com<mailto:ekoif...@hortonworks.com>> wrote:
if you split “create table test.dummy as select * from oraclehadoop.dummy;”
into create table statement, followed by insert into test.dummy as select… you 
should see the behavior you expect with Hive.
Drop statement will block while insert is running.

Eugene

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Wednesday, June 8, 2016 at 3:12 PM
To: Michael Segel <michael_se...@hotmail.com<mailto:michael_se...@hotmail.com>>
Cc: David Newberger 
<david.newber...@wandcorp.com<mailto:david.newber...@wandcorp.com>>, 
"user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>, "user @spark" 
<u...@spark.apache.org<mailto:u...@spark.apache.org>>
Subject: Re: Creating a Hive table through Spark and potential locking issue (a 
bug)

Hive version is 2

We can discuss all sorts of scenarios.  However, Hivek is pretty good at 
applying the locks at both the table and partition level. The idea of having a 
metadata is to enforce these rules.

[Inline images 1]

For example above inserting from source to target table partitioned (year, 
month) shows that locks are applied correctly

This is Hive running on Spark engine. The crucial point is that Hive accesses 
its metadata and updates its hive_locks table. Again one can see from data held 
in that table in metadata

[Inline images 2]

So I think there is a genuine issue here

HTH



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 8 June 2016 at 22:36, Michael Segel 
<michael_se...@hotmail.com<mailto:michael_se...@hotmail.com>> wrote:
Hi,

Lets take a step back…

Which version of Hive?

Hive recently added transaction support so you have to know your isolation 
level.

Also are you running spark as your execution engine, or are you talking about a 
spark app running w a hive context and then you drop the table from within a 
Hive shell while the spark app is still running?

You also have two different things happening… you’re mixing a DDL with a query. 
 How does hive know you have another app reading from the table?
I mean what happens when you try a select * from foo; and in another shell try 
dropping foo?  and if you want to simulate a m/r job add something like an 
order by 1 clause.

HTH

-Mike



On Jun 8, 2016, at 1:44 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Hi,

The idea of accessing Hive metada is to be aware of concurrency.



 In generall if I do the following In Hive


hive> create table test.dummy as select * from oraclehadoop.dummy;



We can see that hive applies the locks in Hive


<image.png>







However, there seems to be an issue. I do not see any exclusive lock on the 
target table (i.e. test.dummy). The locking type SHARED_READ on source table 
oraclehadoop.dummy looks OK


 One can see the locks  in Hive database




<image.png>




So there are few issues here:


  1.  With Hive -> The source table is locked as SHARED_READ
  2.  With Spark --> No locks at all
  3.  With HIVE --> No locks on the target table
  4.  With Spark --> No locks at all

 HTH











Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 8 June 2016 at 20:22, David Newberger 
<david.newber...@wandcorp.com<mailto:david.newber...@wandcorp.com>> wrote:
Could you be looking at 2 jobs trying to use the same file and one getting to 
it before the other and finally removing it?

David Newberger

From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>]
Sent: Wednesday, June 8, 2016 1:33 PM
To: user; user @spark
Subject: Creating a Hive table through Spark and potential locking issue (a bug)


Hi,

I noticed an issue with Spark creating and populating a Hive table.

The process as I see is as follows:


  1.  Spark creates the Hive table. In this case an ORC table in a Hive Database
  2.  Spark uses JDBC connection to get data out from an Oracle
  3.  I create a temp table in Spark through (registerTempTable)
  4.  Spark populates that table. That table is actually created in

       hdfs dfs -ls /tmp/hive/hduser
       drwx------   - hduser supergroup
       /tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059



  1.  However, The original table itself does not have any locking on it!
  2.  I log in into Hive and drop that table

3.hive> drop table dummy;
OK


  1.   That table is dropped OK
  2.  Spark crashes with message
Started at
[08/06/2016 18:37:53.53]
16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/user/hive/warehouse/oraclehadoop.db/dummy/.hive-staging_hive_2016-06-08_18-38-08_804_3299712811201460314-1/-ext-10000/_temporary/0/_temporary/attempt_201606081838_0001_m_000000_0/part-00000
 (inode 831621): File does not exist. Holder 
DFSClient_NONMAPREDUCE_-1836386597_1 does not have any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3313)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3169)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
        at org.apache.hadoop.ipc.Client.call(Client.java:1468)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
16/06/08 19:13:46 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
aborting job

Suggested solution.
In a concurrent env, Spark should apply locks in order to prevent such 
operations. Locks are kept in Hive meta data table HIVE_LOCKS

HTH
Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Re: Creating a Hive table through Spark and potential locking issue (a bug)

Reply via email to