Hi Saad
In your initial mail you mentioned that there are lots
of checkAndPut ops but on different rows. The failure in obtaining
locks (write lock as it is checkAndPut) means there is contention on
the same row key. If that is the case , ya that is the 1st step
before BC reads and
The question remain though of why it is even accessing a column family's
files that should be excluded based on the Scan. And that column family
does NOT specify prefetch on open in its schema. Only the one we want to
read specifies prefetch on open, which we want to override if possible for
the
See below more I found on item 3.
Cheers.
Saad
On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti wrote:
> Hi,
>
> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
> no Hbase installed on the cluster, only HBase libs linked to my Spark app.
> We
Although now that I think about this a bit more, all the failures we saw
were failure to obtain a row lock, and in the thread stack traces we always
saw it somewhere inside getRowLockInternal and similar. Never saw any
contention on bucket cache lock that I could see.
Cheers.
Saad
On Sat,
Are you using AuthUtil class to reauthenticate? This class is in Hbase, and
uses the Hadoop class UserGroupInformation to do the actual login and
re-login. But, if your UserGroupInformation class is from Hadoop 2.5.1 or
earlier, it has a bug if you are using Java 8, as most of us are. The
relogin
Also, for now we have mitigated this problem by using the new setting in
HBase 1.4.0 that prevents one slow region server from blocking all client
requests. Of course it causes some timeouts but our overall ecosystem
contains Kafka queues for retries, so we can live with that. From what I
can see,
Hi,
I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
no Hbase installed on the cluster, only HBase libs linked to my Spark app.
We are reading the snapshot info from a HBase folder in S3 using
TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
So if I understand correctly, we would mitigate the problem by not evicting
blocks for archived files immediately? Wouldn't this potentially lead to
problems later if the LRU algo chooses to evict blocks for active files and
leave blocks for archived files in there?
I would definitely love to