[ https://issues.apache.org/jira/browse/HUDI-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-1289: --------------------------------- Labels: pull-request-available (was: ) > Using hbase index in spark hangs in Hudi 0.6.0 > ---------------------------------------------- > > Key: HUDI-1289 > URL: https://issues.apache.org/jira/browse/HUDI-1289 > Project: Apache Hudi > Issue Type: Bug > Reporter: Ryan Pifer > Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > In Hudi 0.6.0 I can see that there was a change to shade the hbase > dependencies in hudi-spark-bundle jar. When using HBASE index with only > hudi-spark-bundle jar specified in spark session there are several issues: > > # Dependencies are not being correctly resolved: > Hbase default status listener class value is defined by the class name before > relocation > {code:java} > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not > org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2427) at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:656) > ... 39 moreCaused by: java.lang.RuntimeException: class > org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not > org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2421) ... > 40 more{code} > > [https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClusterStatusListener.java#L72-L73] > > This can be fixed by overriding the status listener class in the hbase > configuration used in hudi > {code:java} > hbaseConfig.set("hbase.status.listener.class", > "org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener"){code} > [https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java#L134] > > 2. After modifying the above, executors hang when trying to connect to hbase > and fail after about 45 minutes > {code:java} > Caused by: > org.apache.hudi.org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed after attempts=36, exceptions:Thu Sep 17 23:59:42 UTC 2020, null, > java.net.SocketTimeoutException: callTimeout=60000, callDuration=68536: row > 'hudiindex,12345678,99999999999999' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=ip-10-81-236-56.ec2.internal,16020,1600130997457, seqNum=0 > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60) > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1165) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1122) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:957) > at > org.apache.hudi.org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83) > at > org.apache.hudi.org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:75) > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:134) > ... 35 more{code} > > When investigating the executor logs I was able to find the following > {code:java} > > 20/09/18 21:35:48 TRACE TransportClient: Sending RPC to > ip-10-31-253-39.ec2.internal/10.31.253.39:46825 > 20/09/18 21:35:48 TRACE TransportClient: Sending request RPC > 7802669247197305083 to ip-10-31-253-39.ec2.internal/10.31.253.39:46825 took 0 > ms > 20/09/18 21:35:48 TRACE MessageDecoder: Received message RpcResponse: > RpcResponse{requestId=7802669247197305083, > body=NettyManagedBuffer{buf=SimpleLeakAwareByteBuf(PooledUnsafeDirectByteBuf(ridx: > 21, widx: 102, cap: 128))}} > 20/09/18 21:35:53 TRACE ZooKeeperRegistry: Looking up meta region location in > ZK, > connection=org.apache.hudi.org.apache.hadoop.hbase.client.ZooKeeperRegistry@268d3bae > 20/09/18 21:35:53 TRACE ZKUtil: hconnection-0x4f596c31-0x10000036821007a, > quorum=ip-10-31-253-39.ec2.internal:2181, baseZNode=/hbase Retrieved 51 > byte(s) of data from znode /hbase/meta-region-server; > data=PBUF\x0A)\x0A\x1Dip-10-16-254... > 20/09/18 21:35:53 TRACE ZooKeeperRegistry: Looked up meta region location, > connection=org.apache.hudi.org.apache.hadoop.hbase.client.ZooKeeperRegistry@268d3bae; > servers = ip-10-16-254-233.ec2.internal,16020,1600298383776 > 20/09/18 21:35:53 TRACE MetaCache: Merged cached locations: > [region=hbase:meta,,1.1588230740, > hostname=ip-10-16-254-233.ec2.internal,16020,1600298383776, seqNum=0] > 20/09/18 21:35:53 DEBUG RpcClientImpl: Use SIMPLE authentication for service > ClientService, sasl=false > 20/09/18 21:35:53 DEBUG RpcClientImpl: Connecting to > ip-10-16-254-233.ec2.internal/10.16.254.233:16020 > 20/09/18 21:35:53 TRACE RpcClientImpl: IPC Client (669392975) connection to > ip-10-16-254-233.ec2.internal/10.16.254.233:16020 from hadoop: starting, > connections 1 > 20/09/18 21:35:53 TRACE RpcClientImpl: IPC Client (669392975) connection to > ip-10-16-254-233.ec2.internal/10.16.254.233:16020 from hadoop: marking at > should close, reason: null > 20/09/18 21:35:53 TRACE RpcClientImpl: IPC Client (669392975) connection to > ip-10-16-254-233.ec2.internal/10.16.254.233:16020 from hadoop: closing ipc > connection to ip-10-16-254-233.ec2.internal/10.16.254.233:16020 > 20/09/18 21:35:53 TRACE RpcClientImpl: IPC Client (669392975) connection to > ip-10-16-254-233.ec2.internal/10.16.254.233:16020 from hadoop: ipc connection > to ip-10-16-254-233.ec2.internal/10.16.254.233:16020 closed > 20/09/18 21:35:53 TRACE RpcClientImpl: IPC Client (669392975) connection to > ip-10-16-254-233.ec2.internal/10.16.254.233:16020 from hadoop: stopped, > connections 0 > 20/09/18 21:35:53 INFO RpcRetryingCaller: MESSAGE: Call to > ip-10-16-254-233.ec2.internal/10.16.254.233:16020 failed on local exception: > org.apache.hudi.org.apache.hadoop.hbase.exceptions.ConnectionClosingException: > Connection to ip-10-16-254-233.ec2.internal/10.16.254.233:16020 is closing. > Call id=418, waitTime=2 > 20/09/18 21:35:53 INFO RpcRetryingCaller: > STACKTRACE[Ljava.lang.StackTraceElement;@20efcd07 > 20/09/18 21:35:53 INFO RpcRetryingCaller: CAUSE > org.apache.hudi.org.apache.hadoop.hbase.exceptions.ConnectionClosingException: > Connection to ip-10-16-254-233.ec2.internal/10.16.254.233:16020 is closing. > Call id=418, waitTime=2 > at > org.apache.hudi.org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.cleanupCalls(RpcClientImpl.java:1089) > at > org.apache.hudi.org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.close(RpcClientImpl.java:865) > at > org.apache.hudi.org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.run(RpcClientImpl.java:582) > 20/09/18 21:35:53 INFO RpcRetryingCaller: Call exception, tries=10, > retries=35, started=38363 ms ago, cancelled=false, msg=row > 'huditest,12345678,99999999999999' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=ip-10-16-254-233.ec2.internal,16020,1600298383776, seqNum=0 > 20/09/18 21:35:53 TRACE MetaCache: Removed region=hbase:meta,,1.1588230740, > hostname=ip-10-16-254-233.ec2.internal,16020,1600298383776, seqNum=0 from > cache > > > {code} > > Even after adding the hbase jars to the session it will continue to hang. I > was able to resolve the hanging issue by building the hudi spark bundle jar > without shading the hbase related dependencies and adding them expliciting > when launching my spark shell so it seems like a problem with relocation. > > Example of able to use hbase index successfully: > {code:java} > spark-shell --jars > /usr/lib/hudi/cli/lib/hbase-client-1.2.3.jar,/usr/lib/hudi/cli/lib/hbase-common-1.2.3.jar,usr/lib/hudi/cli/lib/hbase-protocol-1.2.3.jar,/usr/lib/hudi/cli/lib/htrace-core-3.1.0-incubating.jar,/usr/lib/hudi/cli/lib/metrics-core-2.2.0.jar,hudi-spark-bundle_2.11-0.6.0-amzn-0.jar,/usr/lib/spark/external/lib/spark-avro.jar > --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf > "spark.sql.hive.convertMetastoreParquet=false" > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)