Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-25 Thread Josh Elser

Coming full circle on the "makes me worry" comment I left:

I asked the question in work channels about my concern and SteveL did 
confirm that the "S3 strong consistency" feature does apply generally to 
CRUD operations.


I believe this means, if we assume there is exactly one RegionServer 
which is hosting a Region at one time, that one RegionServer is capable 
of ensuring that the gaps which do exist in S3 are a non-issue (without 
the need for an HBOSS-like solution).


Taking the suggested on a file-per-store which enumerates the committed 
files: the RegionServer can make sure that operates which concurrently 
want to update that file are exclusive, e.g. a bulk load, a memstore 
flush, a compaction commit.


On my plate today is to incorporate this into a design doc specifically 
for storefile metadata (from the other message in this broader thread)


On 5/24/21 1:39 PM, Josh Elser wrote:
I got pulled into a call with some folks from S3 at the last minute late 
week.


There was a comment made in passing about reading the latest, written 
version of a file. At the moment, I didn't want to digress into that 
because of immutable HFiles. However, if we're tracking files-per-store 
in a file, that makes me worry.


To the nice digging both Duo and Andrew have shared here already and 
Nick's point about design, I definitely think stating what we expect and 
mapping that to the "platforms" which provide that "today" (as we know 
each will change) is the only way to insulate ourselves. The Hadoop FS 
contract tests are also a great thing we can adopt.


On 5/21/21 9:53 PM, 张铎(Duo Zhang) wrote:

So maybe we could introduce a .hfilelist directory, and put the hflielist
files under this directory, so we do not need to list all the files under
the region directory.

And considering the possible implementation for typical object storages,
listing the last directory on the whole path will be less expensive.

Andrew Purtell  于2021年5月22日周六 上午9:35 
写道:





On May 21, 2021, at 6:07 PM, 张铎  wrote:

Since we just make use of the general FileSystem API to do listing, is

it

possible to make use of ' bucket index listing'?


Yes, those words mean the same thing.



Andrew Purtell  于2021年5月22日周六 上午 
6:34写道:






On May 20, 2021, at 4:00 AM, Wellington Chevreuil <

wellington.chevre...@gmail.com> wrote:






IMO it should be a file per store.
Per region is not suitable here as compaction is per store.
Per file means we still need to list all the files. And usually, 
after

compaction, we need to do an atomic operation to remove several old

files

and add a new file, or even several files for stripe compaction. It

will be

easy if we just write one file to commit these changes.



Fine for me if it's simpler. Mentioned the per file approach 
because I

thought it could be easier/faster to do that, rather than having to

update
the store file list on every flush. AFAIK, append is out of the 
table,

so
updating this file would mean read it, write original content plus 
new

hfile to a temp file, delete original file, rename it).



That sounds right to be.

A minor potential optimization is the filename could have a timestamp
component, so a bucket index listing at that path would pick up a list
including the latest, and the latest would be used as the manifest of

valid

store files. The cloud object store is expected to provide an atomic
listing semantic where the file is written and closed and only then is

it

visible, and it is visible at once to everyone. (I think this is

available

on most.) Old manifest file versions could be lazily deleted.



Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <

palomino...@gmail.com>

escreveu:

IIRC S3 is the only object storage which does not guarantee
read-after-write consistency in the past...

This is the quick result after googling

AWS [1]

Amazon S3 delivers strong read-after-write consistency 
automatically

for

all applications



Azure[2]

Azure Storage was designed to embrace a strong consistency model 
that

guarantees that after the service performs an insert or update

operation,

subsequent read operations return the latest update.



Aliyun[3]


A feature requires that object operations in OSS be atomic, which
indicates that operations can only either succeed or fail without
intermediate states. To ensure that users can access only complete

data,

OSS does not return corrupted or partial data.

Object operations in OSS are highly consistent. For example, when a

user
receives an upload (PUT) success response, the uploaded object 
can be

read
immediately, and copies of the object are written to multiple 
devices

for
redundancy. Therefore, the situations where data is not obtained 
when

you
perform the read-after-write operation do not exist. The same is 
true

for

delete operations. After you delete an object, the object and its

copies

no

longer exist.



GCP[4]


Cloud Storage provides strong global consistency for the following
operations, 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-24 Thread Duo Zhang
Just go ahead Josh, I haven't started to write the design doc yet.

Thank you for your help!

Josh Elser  于2021年5月25日周二 上午1:45写道:

> Without completely opening Pandora's box, I will say we definitely have
> multiple ways we can solve the metadata management for tracking (e.g. in
> meta, in some other system table, in some other system, in a per-store
> file). Each of them have pro's and con's, and each of them has "favor"
> as to what pain we've most recently felt as a project.
>
> I don't want to defer having the discussion on what the "correct" one
> should be, but I do want to point out that it's only half of the problem
> of storefile tracking.
>
> My hope is that we can make this tracking system be pluggable, such that
> we can prototype a solution that works "good enough" for now and enables
> the rest of the development work to keep moving forward.
>
> I'm happy to see so many other folks also interested in the design of
> how we store this.
>
> Could I suggest we move this discussion around the metadata storage into
> its own thread? If Duo doesn't already have a design doc started, I can
> also try to put one together this week.
>
> Does that work for you all?
>
> On 5/22/21 11:02 AM, 张铎(Duo Zhang) wrote:
> > I could put up a simple design doc for this.
> >
> > But there is still a problem, about how to do rolling upgrading.
> >
> > After we changed the behavior, the region server will write partial store
> > files directly into the data directory. For new region servers, this is
> not
> > a problem, as we will read the hfilelist file to find out the valid store
> > files.
> > But when rolling upgrading, we can not upgrade all the regionservers at
> > once, for old regionservers, they will initialize a store by listing the
> > store files, so if a new regionserver crashes when compacting and its
> > regions are assigned to old regionservers, the old regionservers will be
> in
> > trouble...
> >
> > Stack  于2021年5月22日周六 下午12:14写道:
> >
> >> HBASE-24749 design and implementation had acknowledged compromises on
> >> review: e.g. adding a new 'system table' to hold store files.  I'd
> suggest
> >> the design and implementation need a revisit before we go forward; for
> >> instance, factoring for systems other than s3 as suggested above (I like
> >> the Duo list).
> >>
> >> S
> >>
> >> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> >> wrote:
> >>
> >>> What about just storing the hfile list in a file? Since now S3 has
> strong
> >>> consistency, we could safely overwrite a file then I think?
> >>>
> >>> And since the hfile list file will be very small, renaming will not be
> a
> >>> big problem.
> >>>
> >>> We could write the hfile list to a file called 'hfile.list.tmp', and
> then
> >>> rename it to 'hfile.list'.
> >>>
> >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> could
> >>> face that, the 'hfile.list' file is not there, but there is a
> >>> 'hfile.list.tmp'.
> >>>
> >>> So when opening a HStore, we first check if 'hfile.list' is there, if
> >> not,
> >>> try 'hfile.list.tmp', rename it and load it. For safety, we could write
> >> an
> >>> initial hfile list file with no hfiles. So if we can not load either
> >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> >> users
> >>> should try to fix  it with HBCK.
> >>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
> >>>
> >>> WDYT?
> >>>
> >>> Thanks.
> >>>
> >>> Wellington Chevreuil  于2021年5月19日周三
> >>> 下午10:43写道:
> >>>
>  Thank you, Andrew and Duo,
> 
>  Talking internally with Josh Elser, initial idea was to rebase the
> >>> feature
>  branch with master (in order to catch with latest commits), then focus
> >> on
>  work to have a minimal functioning hbase, in other words, together
> with
> >>> the
>  already committed work from HBASE-25391, make sure flush, compactions,
>  splits and merges all can take advantage of the persistent store file
>  manager and complete with no need to rely on renames. These all map to
> >>> the
>  substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> >>> and
>  validate this works well for our goals, we can then focus on
> snapshots,
>  bulkloading and tooling.
> 
>  S3 now supports strong consistency, and I heard that they are also
> > implementing atomic renaming currently, so maybe that's one of the
>  reasons
> > why the development is silent now..
> >
>  Interesting, I had no idea this was being implemented. I know,
> >> however, a
>  version of this feature is already available on latest EMR releases
> (at
>  least from 6.2.0), and AWS team has published their own blog post with
>  their results:
> 
> 
> >>>
> >>
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> 
>  But I do not think store hfile list in meta is the only solution. It
> 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-24 Thread Duo Zhang
Oh, sorry. Missed that.

I think the key point here is we should not have partial storefiles in the
data directory if we want to downgrade. This is possible by setting the
flag to false first to prevent new partial storefiles, and then use a HBCK
command to remove all the partial storefiles?
And in general, I think we should have a way to clean the broken storefiles
automatically when the new layout is in use, so we could also rely on it to
remove the broken storefiles maybe.

Thanks.

Andrew Purtell  于2021年5月25日周二 上午12:24写道:

> > And for downgrading, usually we do not support downgrading from a major
> version upgrading, so it is not a big problem.
>
> You missed an earlier comment from me.
>
> Our team requires this to be released in a branch-2 version or we can't use
> it. Therefore I am not in favor of any solution that requires a major
> version increment.
>
>
> On Sun, May 23, 2021 at 5:43 AM 张铎(Duo Zhang) 
> wrote:
>
> > I do not think it should be a table level config. It should be a cluster
> > level config. We only have one FileSystem so it is useless to let
> different
> > tables have different ways to store hfile list.
> >
> > But I think the general approach is fine. We could introduce a config for
> > whether to enable 'write to data directory directly' mode. When rolling
> > upgrading, the flag should be false, you can change it to true after the
> > whole cluster has been upgraded.
> >
> > And for downgrading, usually we do not support downgrading from a major
> > version upgrading, so it is not a big problem.
> >
> > Thanks.
> >
> > Andrew Purtell  于2021年5月23日周日 上午12:53写道:
> >
> > > Put a check in the code whether hfilelist mode or original store layout
> > is
> > > in use and handles both cases. Then, to upgrade:
> > >
> > > 1. First, perform a rolling upgrade to $NEW_VERSION .
> > >
> > > 2. Once upgraded to $NEW_VERSION execute an alter table command that
> > > enables hfilelist mode. This will cause all regions to close and reopen
> > in
> > > the new mode.
> > >
> > > Because the rolling upgrade to $NEW_VERSION is completed first a mix of
> > > old and new layouts is fine, for the brief period of time when store
> > > layouts are upgrading in response to the alter command, because this
> > > version can handle both.
> > >
> > > Downgrade to an older version is not possible after the alter table
> > > command, so this must be clearly documented, but of course would not
> be a
> > > surprise to anyone, because the alter command is for switching to the
> new
> > > store layout.
> > >
> > >
> > > > On May 22, 2021, at 8:03 AM, 张铎  wrote:
> > > >
> > > > I could put up a simple design doc for this.
> > > >
> > > > But there is still a problem, about how to do rolling upgrading.
> > > >
> > > > After we changed the behavior, the region server will write partial
> > store
> > > > files directly into the data directory. For new region servers, this
> is
> > > not
> > > > a problem, as we will read the hfilelist file to find out the valid
> > store
> > > > files.
> > > > But when rolling upgrading, we can not upgrade all the regionservers
> at
> > > > once, for old regionservers, they will initialize a store by listing
> > the
> > > > store files, so if a new regionserver crashes when compacting and its
> > > > regions are assigned to old regionservers, the old regionservers will
> > be
> > > in
> > > > trouble...
> > > >
> > > > Stack  于2021年5月22日周六 下午12:14写道:
> > > >
> > > >> HBASE-24749 design and implementation had acknowledged compromises
> on
> > > >> review: e.g. adding a new 'system table' to hold store files.  I'd
> > > suggest
> > > >> the design and implementation need a revisit before we go forward;
> for
> > > >> instance, factoring for systems other than s3 as suggested above (I
> > like
> > > >> the Duo list).
> > > >>
> > > >> S
> > > >>
> > > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <
> palomino...@gmail.com
> > >
> > > >>> wrote:
> > > >>>
> > > >>> What about just storing the hfile list in a file? Since now S3 has
> > > strong
> > > >>> consistency, we could safely overwrite a file then I think?
> > > >>>
> > > >>> And since the hfile list file will be very small, renaming will not
> > be
> > > a
> > > >>> big problem.
> > > >>>
> > > >>> We could write the hfile list to a file called 'hfile.list.tmp',
> and
> > > then
> > > >>> rename it to 'hfile.list'.
> > > >>>
> > > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> > > could
> > > >>> face that, the 'hfile.list' file is not there, but there is a
> > > >>> 'hfile.list.tmp'.
> > > >>>
> > > >>> So when opening a HStore, we first check if 'hfile.list' is there,
> if
> > > >> not,
> > > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could
> > write
> > > >> an
> > > >>> initial hfile list file with no hfiles. So if we can not load
> either
> > > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong
> so
> > > >> users
> > > >>> should try to fix  it with 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-24 Thread Josh Elser
Without completely opening Pandora's box, I will say we definitely have 
multiple ways we can solve the metadata management for tracking (e.g. in 
meta, in some other system table, in some other system, in a per-store 
file). Each of them have pro's and con's, and each of them has "favor" 
as to what pain we've most recently felt as a project.


I don't want to defer having the discussion on what the "correct" one 
should be, but I do want to point out that it's only half of the problem 
of storefile tracking.


My hope is that we can make this tracking system be pluggable, such that 
we can prototype a solution that works "good enough" for now and enables 
the rest of the development work to keep moving forward.


I'm happy to see so many other folks also interested in the design of 
how we store this.


Could I suggest we move this discussion around the metadata storage into 
its own thread? If Duo doesn't already have a design doc started, I can 
also try to put one together this week.


Does that work for you all?

On 5/22/21 11:02 AM, 张铎(Duo Zhang) wrote:

I could put up a simple design doc for this.

But there is still a problem, about how to do rolling upgrading.

After we changed the behavior, the region server will write partial store
files directly into the data directory. For new region servers, this is not
a problem, as we will read the hfilelist file to find out the valid store
files.
But when rolling upgrading, we can not upgrade all the regionservers at
once, for old regionservers, they will initialize a store by listing the
store files, so if a new regionserver crashes when compacting and its
regions are assigned to old regionservers, the old regionservers will be in
trouble...

Stack  于2021年5月22日周六 下午12:14写道:


HBASE-24749 design and implementation had acknowledged compromises on
review: e.g. adding a new 'system table' to hold store files.  I'd suggest
the design and implementation need a revisit before we go forward; for
instance, factoring for systems other than s3 as suggested above (I like
the Duo list).

S

On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
wrote:


What about just storing the hfile list in a file? Since now S3 has strong
consistency, we could safely overwrite a file then I think?

And since the hfile list file will be very small, renaming will not be a
big problem.

We could write the hfile list to a file called 'hfile.list.tmp', and then
rename it to 'hfile.list'.

This is safe for HDFS, and for S3, since it is not atomic, maybe we could
face that, the 'hfile.list' file is not there, but there is a
'hfile.list.tmp'.

So when opening a HStore, we first check if 'hfile.list' is there, if

not,

try 'hfile.list.tmp', rename it and load it. For safety, we could write

an

initial hfile list file with no hfiles. So if we can not load either
'hfile.list' or 'hfile.list.tmp', then we know something is wrong so

users

should try to fix  it with HBCK.
And in HBCK, we will do a listing and generate the 'hfile.list' file.

WDYT?

Thanks.

Wellington Chevreuil  于2021年5月19日周三
下午10:43写道:


Thank you, Andrew and Duo,

Talking internally with Josh Elser, initial idea was to rebase the

feature

branch with master (in order to catch with latest commits), then focus

on

work to have a minimal functioning hbase, in other words, together with

the

already committed work from HBASE-25391, make sure flush, compactions,
splits and merges all can take advantage of the persistent store file
manager and complete with no need to rely on renames. These all map to

the

substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test

and

validate this works well for our goals, we can then focus on snapshots,
bulkloading and tooling.

S3 now supports strong consistency, and I heard that they are also

implementing atomic renaming currently, so maybe that's one of the

reasons

why the development is silent now..


Interesting, I had no idea this was being implemented. I know,

however, a

version of this feature is already available on latest EMR releases (at
least from 6.2.0), and AWS team has published their own blog post with
their results:





https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/


But I do not think store hfile list in meta is the only solution. It

will

cause cyclic dependencies for hbase:meta, and then force us a have a
fallback solution which makes the code a bit ugly. We should try to

see

if

this could be done with only the FileSystem.


This is indeed a relevant concern. One idea I had mentioned in the

original

design doc was to track committed/non-committed files through xattr (or
tags), which may have its own performance issues as explained by

Stephen

Wu, but is something that could be attempted.

Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <

palomino...@gmail.com



escreveu:


S3 now supports strong consistency, and I heard that they are also
implementing 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-24 Thread Josh Elser
I got pulled into a call with some folks from S3 at the last minute late 
week.


There was a comment made in passing about reading the latest, written 
version of a file. At the moment, I didn't want to digress into that 
because of immutable HFiles. However, if we're tracking files-per-store 
in a file, that makes me worry.


To the nice digging both Duo and Andrew have shared here already and 
Nick's point about design, I definitely think stating what we expect and 
mapping that to the "platforms" which provide that "today" (as we know 
each will change) is the only way to insulate ourselves. The Hadoop FS 
contract tests are also a great thing we can adopt.


On 5/21/21 9:53 PM, 张铎(Duo Zhang) wrote:

So maybe we could introduce a .hfilelist directory, and put the hflielist
files under this directory, so we do not need to list all the files under
the region directory.

And considering the possible implementation for typical object storages,
listing the last directory on the whole path will be less expensive.

Andrew Purtell  于2021年5月22日周六 上午9:35写道:




On May 21, 2021, at 6:07 PM, 张铎  wrote:

Since we just make use of the general FileSystem API to do listing, is

it

possible to make use of ' bucket index listing'?


Yes, those words mean the same thing.



Andrew Purtell  于2021年5月22日周六 上午6:34写道:





On May 20, 2021, at 4:00 AM, Wellington Chevreuil <

wellington.chevre...@gmail.com> wrote:






IMO it should be a file per store.
Per region is not suitable here as compaction is per store.
Per file means we still need to list all the files. And usually, after
compaction, we need to do an atomic operation to remove several old

files

and add a new file, or even several files for stripe compaction. It

will be

easy if we just write one file to commit these changes.



Fine for me if it's simpler. Mentioned the per file approach because I
thought it could be easier/faster to do that, rather than having to

update

the store file list on every flush. AFAIK, append is out of the table,

so

updating this file would mean read it, write original content plus new
hfile to a temp file, delete original file, rename it).



That sounds right to be.

A minor potential optimization is the filename could have a timestamp
component, so a bucket index listing at that path would pick up a list
including the latest, and the latest would be used as the manifest of

valid

store files. The cloud object store is expected to provide an atomic
listing semantic where the file is written and closed and only then is

it

visible, and it is visible at once to everyone. (I think this is

available

on most.) Old manifest file versions could be lazily deleted.



Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <

palomino...@gmail.com>

escreveu:

IIRC S3 is the only object storage which does not guarantee
read-after-write consistency in the past...

This is the quick result after googling

AWS [1]


Amazon S3 delivers strong read-after-write consistency automatically

for

all applications



Azure[2]


Azure Storage was designed to embrace a strong consistency model that
guarantees that after the service performs an insert or update

operation,

subsequent read operations return the latest update.



Aliyun[3]


A feature requires that object operations in OSS be atomic, which
indicates that operations can only either succeed or fail without
intermediate states. To ensure that users can access only complete

data,

OSS does not return corrupted or partial data.

Object operations in OSS are highly consistent. For example, when a

user

receives an upload (PUT) success response, the uploaded object can be

read

immediately, and copies of the object are written to multiple devices

for

redundancy. Therefore, the situations where data is not obtained when

you

perform the read-after-write operation do not exist. The same is true

for

delete operations. After you delete an object, the object and its

copies

no

longer exist.



GCP[4]


Cloud Storage provides strong global consistency for the following
operations, including both data and metadata:

Read-after-write
Read-after-metadata-update
Read-after-delete
Bucket listing
Object listing



I think these vendors could cover most end users in the world?

1. https://aws.amazon.com/cn/s3/consistency/
2.





https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet

3. https://www.alibabacloud.com/help/doc-detail/31827.htm
4. https://cloud.google.com/storage/docs/consistency

Nick Dimiduk  于2021年5月19日周三 下午11:40写道:


On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 


wrote:


What about just storing the hfile list in a file? Since now S3 has

strong

consistency, we could safely overwrite a file then I think?



My concern is about portability. S3 isn't the only blob store in

town,

and

consistent read-what-you-wrote semantics are not a standard feature,

as

far

as I know. If we want something that can work on 3 or 5 major public

cloud


Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-24 Thread Andrew Purtell
The important detail is first there is an upgrade to a version that can
support the new store layout across the whole cluster, so there will be no
rolling upgrade related issues when the new layout is enabled.

The new layout can be enabled with a new site config, a shell command to
set a schema feature, whatever, that part can be hashed out in the design
discussion. A cluster level configuration is fine.



On Sun, May 23, 2021 at 5:43 AM 张铎(Duo Zhang)  wrote:

> I do not think it should be a table level config. It should be a cluster
> level config. We only have one FileSystem so it is useless to let different
> tables have different ways to store hfile list.
>
> But I think the general approach is fine. We could introduce a config for
> whether to enable 'write to data directory directly' mode. When rolling
> upgrading, the flag should be false, you can change it to true after the
> whole cluster has been upgraded.
>
> And for downgrading, usually we do not support downgrading from a major
> version upgrading, so it is not a big problem.
>
> Thanks.
>
> Andrew Purtell  于2021年5月23日周日 上午12:53写道:
>
> > Put a check in the code whether hfilelist mode or original store layout
> is
> > in use and handles both cases. Then, to upgrade:
> >
> > 1. First, perform a rolling upgrade to $NEW_VERSION .
> >
> > 2. Once upgraded to $NEW_VERSION execute an alter table command that
> > enables hfilelist mode. This will cause all regions to close and reopen
> in
> > the new mode.
> >
> > Because the rolling upgrade to $NEW_VERSION is completed first a mix of
> > old and new layouts is fine, for the brief period of time when store
> > layouts are upgrading in response to the alter command, because this
> > version can handle both.
> >
> > Downgrade to an older version is not possible after the alter table
> > command, so this must be clearly documented, but of course would not be a
> > surprise to anyone, because the alter command is for switching to the new
> > store layout.
> >
> >
> > > On May 22, 2021, at 8:03 AM, 张铎  wrote:
> > >
> > > I could put up a simple design doc for this.
> > >
> > > But there is still a problem, about how to do rolling upgrading.
> > >
> > > After we changed the behavior, the region server will write partial
> store
> > > files directly into the data directory. For new region servers, this is
> > not
> > > a problem, as we will read the hfilelist file to find out the valid
> store
> > > files.
> > > But when rolling upgrading, we can not upgrade all the regionservers at
> > > once, for old regionservers, they will initialize a store by listing
> the
> > > store files, so if a new regionserver crashes when compacting and its
> > > regions are assigned to old regionservers, the old regionservers will
> be
> > in
> > > trouble...
> > >
> > > Stack  于2021年5月22日周六 下午12:14写道:
> > >
> > >> HBASE-24749 design and implementation had acknowledged compromises on
> > >> review: e.g. adding a new 'system table' to hold store files.  I'd
> > suggest
> > >> the design and implementation need a revisit before we go forward; for
> > >> instance, factoring for systems other than s3 as suggested above (I
> like
> > >> the Duo list).
> > >>
> > >> S
> > >>
> > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)  >
> > >>> wrote:
> > >>>
> > >>> What about just storing the hfile list in a file? Since now S3 has
> > strong
> > >>> consistency, we could safely overwrite a file then I think?
> > >>>
> > >>> And since the hfile list file will be very small, renaming will not
> be
> > a
> > >>> big problem.
> > >>>
> > >>> We could write the hfile list to a file called 'hfile.list.tmp', and
> > then
> > >>> rename it to 'hfile.list'.
> > >>>
> > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> > could
> > >>> face that, the 'hfile.list' file is not there, but there is a
> > >>> 'hfile.list.tmp'.
> > >>>
> > >>> So when opening a HStore, we first check if 'hfile.list' is there, if
> > >> not,
> > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could
> write
> > >> an
> > >>> initial hfile list file with no hfiles. So if we can not load either
> > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> > >> users
> > >>> should try to fix  it with HBCK.
> > >>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
> > >>>
> > >>> WDYT?
> > >>>
> > >>> Thanks.
> > >>>
> > >>> Wellington Chevreuil  于2021年5月19日周三
> > >>> 下午10:43写道:
> > >>>
> >  Thank you, Andrew and Duo,
> > 
> >  Talking internally with Josh Elser, initial idea was to rebase the
> > >>> feature
> >  branch with master (in order to catch with latest commits), then
> focus
> > >> on
> >  work to have a minimal functioning hbase, in other words, together
> > with
> > >>> the
> >  already committed work from HBASE-25391, make sure flush,
> compactions,
> >  splits and merges all can take advantage of the persistent store
> file
> >  manager and complete 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-24 Thread Andrew Purtell
> I do not think it should be a table level config. It should be a cluster
level config. We only have one FileSystem so it is useless to let different
tables have different ways to store hfile list.

The perspective that claims this "useless" is a limited perspective.

In our clusters, we value features that support incrementalism.


On Sun, May 23, 2021 at 5:43 AM 张铎(Duo Zhang)  wrote:

> I do not think it should be a table level config. It should be a cluster
> level config. We only have one FileSystem so it is useless to let different
> tables have different ways to store hfile list.
>
> But I think the general approach is fine. We could introduce a config for
> whether to enable 'write to data directory directly' mode. When rolling
> upgrading, the flag should be false, you can change it to true after the
> whole cluster has been upgraded.
>
> And for downgrading, usually we do not support downgrading from a major
> version upgrading, so it is not a big problem.
>
> Thanks.
>
> Andrew Purtell  于2021年5月23日周日 上午12:53写道:
>
> > Put a check in the code whether hfilelist mode or original store layout
> is
> > in use and handles both cases. Then, to upgrade:
> >
> > 1. First, perform a rolling upgrade to $NEW_VERSION .
> >
> > 2. Once upgraded to $NEW_VERSION execute an alter table command that
> > enables hfilelist mode. This will cause all regions to close and reopen
> in
> > the new mode.
> >
> > Because the rolling upgrade to $NEW_VERSION is completed first a mix of
> > old and new layouts is fine, for the brief period of time when store
> > layouts are upgrading in response to the alter command, because this
> > version can handle both.
> >
> > Downgrade to an older version is not possible after the alter table
> > command, so this must be clearly documented, but of course would not be a
> > surprise to anyone, because the alter command is for switching to the new
> > store layout.
> >
> >
> > > On May 22, 2021, at 8:03 AM, 张铎  wrote:
> > >
> > > I could put up a simple design doc for this.
> > >
> > > But there is still a problem, about how to do rolling upgrading.
> > >
> > > After we changed the behavior, the region server will write partial
> store
> > > files directly into the data directory. For new region servers, this is
> > not
> > > a problem, as we will read the hfilelist file to find out the valid
> store
> > > files.
> > > But when rolling upgrading, we can not upgrade all the regionservers at
> > > once, for old regionservers, they will initialize a store by listing
> the
> > > store files, so if a new regionserver crashes when compacting and its
> > > regions are assigned to old regionservers, the old regionservers will
> be
> > in
> > > trouble...
> > >
> > > Stack  于2021年5月22日周六 下午12:14写道:
> > >
> > >> HBASE-24749 design and implementation had acknowledged compromises on
> > >> review: e.g. adding a new 'system table' to hold store files.  I'd
> > suggest
> > >> the design and implementation need a revisit before we go forward; for
> > >> instance, factoring for systems other than s3 as suggested above (I
> like
> > >> the Duo list).
> > >>
> > >> S
> > >>
> > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)  >
> > >>> wrote:
> > >>>
> > >>> What about just storing the hfile list in a file? Since now S3 has
> > strong
> > >>> consistency, we could safely overwrite a file then I think?
> > >>>
> > >>> And since the hfile list file will be very small, renaming will not
> be
> > a
> > >>> big problem.
> > >>>
> > >>> We could write the hfile list to a file called 'hfile.list.tmp', and
> > then
> > >>> rename it to 'hfile.list'.
> > >>>
> > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> > could
> > >>> face that, the 'hfile.list' file is not there, but there is a
> > >>> 'hfile.list.tmp'.
> > >>>
> > >>> So when opening a HStore, we first check if 'hfile.list' is there, if
> > >> not,
> > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could
> write
> > >> an
> > >>> initial hfile list file with no hfiles. So if we can not load either
> > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> > >> users
> > >>> should try to fix  it with HBCK.
> > >>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
> > >>>
> > >>> WDYT?
> > >>>
> > >>> Thanks.
> > >>>
> > >>> Wellington Chevreuil  于2021年5月19日周三
> > >>> 下午10:43写道:
> > >>>
> >  Thank you, Andrew and Duo,
> > 
> >  Talking internally with Josh Elser, initial idea was to rebase the
> > >>> feature
> >  branch with master (in order to catch with latest commits), then
> focus
> > >> on
> >  work to have a minimal functioning hbase, in other words, together
> > with
> > >>> the
> >  already committed work from HBASE-25391, make sure flush,
> compactions,
> >  splits and merges all can take advantage of the persistent store
> file
> >  manager and complete with no need to rely on renames. These all map
> to
> > >>> the
> >  

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-24 Thread Andrew Purtell
> And for downgrading, usually we do not support downgrading from a major
version upgrading, so it is not a big problem.

You missed an earlier comment from me.

Our team requires this to be released in a branch-2 version or we can't use
it. Therefore I am not in favor of any solution that requires a major
version increment.


On Sun, May 23, 2021 at 5:43 AM 张铎(Duo Zhang)  wrote:

> I do not think it should be a table level config. It should be a cluster
> level config. We only have one FileSystem so it is useless to let different
> tables have different ways to store hfile list.
>
> But I think the general approach is fine. We could introduce a config for
> whether to enable 'write to data directory directly' mode. When rolling
> upgrading, the flag should be false, you can change it to true after the
> whole cluster has been upgraded.
>
> And for downgrading, usually we do not support downgrading from a major
> version upgrading, so it is not a big problem.
>
> Thanks.
>
> Andrew Purtell  于2021年5月23日周日 上午12:53写道:
>
> > Put a check in the code whether hfilelist mode or original store layout
> is
> > in use and handles both cases. Then, to upgrade:
> >
> > 1. First, perform a rolling upgrade to $NEW_VERSION .
> >
> > 2. Once upgraded to $NEW_VERSION execute an alter table command that
> > enables hfilelist mode. This will cause all regions to close and reopen
> in
> > the new mode.
> >
> > Because the rolling upgrade to $NEW_VERSION is completed first a mix of
> > old and new layouts is fine, for the brief period of time when store
> > layouts are upgrading in response to the alter command, because this
> > version can handle both.
> >
> > Downgrade to an older version is not possible after the alter table
> > command, so this must be clearly documented, but of course would not be a
> > surprise to anyone, because the alter command is for switching to the new
> > store layout.
> >
> >
> > > On May 22, 2021, at 8:03 AM, 张铎  wrote:
> > >
> > > I could put up a simple design doc for this.
> > >
> > > But there is still a problem, about how to do rolling upgrading.
> > >
> > > After we changed the behavior, the region server will write partial
> store
> > > files directly into the data directory. For new region servers, this is
> > not
> > > a problem, as we will read the hfilelist file to find out the valid
> store
> > > files.
> > > But when rolling upgrading, we can not upgrade all the regionservers at
> > > once, for old regionservers, they will initialize a store by listing
> the
> > > store files, so if a new regionserver crashes when compacting and its
> > > regions are assigned to old regionservers, the old regionservers will
> be
> > in
> > > trouble...
> > >
> > > Stack  于2021年5月22日周六 下午12:14写道:
> > >
> > >> HBASE-24749 design and implementation had acknowledged compromises on
> > >> review: e.g. adding a new 'system table' to hold store files.  I'd
> > suggest
> > >> the design and implementation need a revisit before we go forward; for
> > >> instance, factoring for systems other than s3 as suggested above (I
> like
> > >> the Duo list).
> > >>
> > >> S
> > >>
> > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)  >
> > >>> wrote:
> > >>>
> > >>> What about just storing the hfile list in a file? Since now S3 has
> > strong
> > >>> consistency, we could safely overwrite a file then I think?
> > >>>
> > >>> And since the hfile list file will be very small, renaming will not
> be
> > a
> > >>> big problem.
> > >>>
> > >>> We could write the hfile list to a file called 'hfile.list.tmp', and
> > then
> > >>> rename it to 'hfile.list'.
> > >>>
> > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> > could
> > >>> face that, the 'hfile.list' file is not there, but there is a
> > >>> 'hfile.list.tmp'.
> > >>>
> > >>> So when opening a HStore, we first check if 'hfile.list' is there, if
> > >> not,
> > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could
> write
> > >> an
> > >>> initial hfile list file with no hfiles. So if we can not load either
> > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> > >> users
> > >>> should try to fix  it with HBCK.
> > >>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
> > >>>
> > >>> WDYT?
> > >>>
> > >>> Thanks.
> > >>>
> > >>> Wellington Chevreuil  于2021年5月19日周三
> > >>> 下午10:43写道:
> > >>>
> >  Thank you, Andrew and Duo,
> > 
> >  Talking internally with Josh Elser, initial idea was to rebase the
> > >>> feature
> >  branch with master (in order to catch with latest commits), then
> focus
> > >> on
> >  work to have a minimal functioning hbase, in other words, together
> > with
> > >>> the
> >  already committed work from HBASE-25391, make sure flush,
> compactions,
> >  splits and merges all can take advantage of the persistent store
> file
> >  manager and complete with no need to rely on renames. These all map
> to
> > >>> the
> >  substasks 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-23 Thread Duo Zhang
I do not think it should be a table level config. It should be a cluster
level config. We only have one FileSystem so it is useless to let different
tables have different ways to store hfile list.

But I think the general approach is fine. We could introduce a config for
whether to enable 'write to data directory directly' mode. When rolling
upgrading, the flag should be false, you can change it to true after the
whole cluster has been upgraded.

And for downgrading, usually we do not support downgrading from a major
version upgrading, so it is not a big problem.

Thanks.

Andrew Purtell  于2021年5月23日周日 上午12:53写道:

> Put a check in the code whether hfilelist mode or original store layout is
> in use and handles both cases. Then, to upgrade:
>
> 1. First, perform a rolling upgrade to $NEW_VERSION .
>
> 2. Once upgraded to $NEW_VERSION execute an alter table command that
> enables hfilelist mode. This will cause all regions to close and reopen in
> the new mode.
>
> Because the rolling upgrade to $NEW_VERSION is completed first a mix of
> old and new layouts is fine, for the brief period of time when store
> layouts are upgrading in response to the alter command, because this
> version can handle both.
>
> Downgrade to an older version is not possible after the alter table
> command, so this must be clearly documented, but of course would not be a
> surprise to anyone, because the alter command is for switching to the new
> store layout.
>
>
> > On May 22, 2021, at 8:03 AM, 张铎  wrote:
> >
> > I could put up a simple design doc for this.
> >
> > But there is still a problem, about how to do rolling upgrading.
> >
> > After we changed the behavior, the region server will write partial store
> > files directly into the data directory. For new region servers, this is
> not
> > a problem, as we will read the hfilelist file to find out the valid store
> > files.
> > But when rolling upgrading, we can not upgrade all the regionservers at
> > once, for old regionservers, they will initialize a store by listing the
> > store files, so if a new regionserver crashes when compacting and its
> > regions are assigned to old regionservers, the old regionservers will be
> in
> > trouble...
> >
> > Stack  于2021年5月22日周六 下午12:14写道:
> >
> >> HBASE-24749 design and implementation had acknowledged compromises on
> >> review: e.g. adding a new 'system table' to hold store files.  I'd
> suggest
> >> the design and implementation need a revisit before we go forward; for
> >> instance, factoring for systems other than s3 as suggested above (I like
> >> the Duo list).
> >>
> >> S
> >>
> >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> >>> wrote:
> >>>
> >>> What about just storing the hfile list in a file? Since now S3 has
> strong
> >>> consistency, we could safely overwrite a file then I think?
> >>>
> >>> And since the hfile list file will be very small, renaming will not be
> a
> >>> big problem.
> >>>
> >>> We could write the hfile list to a file called 'hfile.list.tmp', and
> then
> >>> rename it to 'hfile.list'.
> >>>
> >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> could
> >>> face that, the 'hfile.list' file is not there, but there is a
> >>> 'hfile.list.tmp'.
> >>>
> >>> So when opening a HStore, we first check if 'hfile.list' is there, if
> >> not,
> >>> try 'hfile.list.tmp', rename it and load it. For safety, we could write
> >> an
> >>> initial hfile list file with no hfiles. So if we can not load either
> >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> >> users
> >>> should try to fix  it with HBCK.
> >>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
> >>>
> >>> WDYT?
> >>>
> >>> Thanks.
> >>>
> >>> Wellington Chevreuil  于2021年5月19日周三
> >>> 下午10:43写道:
> >>>
>  Thank you, Andrew and Duo,
> 
>  Talking internally with Josh Elser, initial idea was to rebase the
> >>> feature
>  branch with master (in order to catch with latest commits), then focus
> >> on
>  work to have a minimal functioning hbase, in other words, together
> with
> >>> the
>  already committed work from HBASE-25391, make sure flush, compactions,
>  splits and merges all can take advantage of the persistent store file
>  manager and complete with no need to rely on renames. These all map to
> >>> the
>  substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> >>> and
>  validate this works well for our goals, we can then focus on
> snapshots,
>  bulkloading and tooling.
> 
>  S3 now supports strong consistency, and I heard that they are also
> > implementing atomic renaming currently, so maybe that's one of the
>  reasons
> > why the development is silent now..
> >
>  Interesting, I had no idea this was being implemented. I know,
> >> however, a
>  version of this feature is already available on latest EMR releases
> (at
>  least from 6.2.0), and AWS team has published their own 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-22 Thread Andrew Purtell
Put a check in the code whether hfilelist mode or original store layout is in 
use and handles both cases. Then, to upgrade:

1. First, perform a rolling upgrade to $NEW_VERSION . 

2. Once upgraded to $NEW_VERSION execute an alter table command that enables 
hfilelist mode. This will cause all regions to close and reopen in the new 
mode. 

Because the rolling upgrade to $NEW_VERSION is completed first a mix of old and 
new layouts is fine, for the brief period of time when store layouts are 
upgrading in response to the alter command, because this version can handle 
both. 

Downgrade to an older version is not possible after the alter table command, so 
this must be clearly documented, but of course would not be a surprise to 
anyone, because the alter command is for switching to the new store layout. 


> On May 22, 2021, at 8:03 AM, 张铎  wrote:
> 
> I could put up a simple design doc for this.
> 
> But there is still a problem, about how to do rolling upgrading.
> 
> After we changed the behavior, the region server will write partial store
> files directly into the data directory. For new region servers, this is not
> a problem, as we will read the hfilelist file to find out the valid store
> files.
> But when rolling upgrading, we can not upgrade all the regionservers at
> once, for old regionservers, they will initialize a store by listing the
> store files, so if a new regionserver crashes when compacting and its
> regions are assigned to old regionservers, the old regionservers will be in
> trouble...
> 
> Stack  于2021年5月22日周六 下午12:14写道:
> 
>> HBASE-24749 design and implementation had acknowledged compromises on
>> review: e.g. adding a new 'system table' to hold store files.  I'd suggest
>> the design and implementation need a revisit before we go forward; for
>> instance, factoring for systems other than s3 as suggested above (I like
>> the Duo list).
>> 
>> S
>> 
>>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
>>> wrote:
>>> 
>>> What about just storing the hfile list in a file? Since now S3 has strong
>>> consistency, we could safely overwrite a file then I think?
>>> 
>>> And since the hfile list file will be very small, renaming will not be a
>>> big problem.
>>> 
>>> We could write the hfile list to a file called 'hfile.list.tmp', and then
>>> rename it to 'hfile.list'.
>>> 
>>> This is safe for HDFS, and for S3, since it is not atomic, maybe we could
>>> face that, the 'hfile.list' file is not there, but there is a
>>> 'hfile.list.tmp'.
>>> 
>>> So when opening a HStore, we first check if 'hfile.list' is there, if
>> not,
>>> try 'hfile.list.tmp', rename it and load it. For safety, we could write
>> an
>>> initial hfile list file with no hfiles. So if we can not load either
>>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
>> users
>>> should try to fix  it with HBCK.
>>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
>>> 
>>> WDYT?
>>> 
>>> Thanks.
>>> 
>>> Wellington Chevreuil  于2021年5月19日周三
>>> 下午10:43写道:
>>> 
 Thank you, Andrew and Duo,
 
 Talking internally with Josh Elser, initial idea was to rebase the
>>> feature
 branch with master (in order to catch with latest commits), then focus
>> on
 work to have a minimal functioning hbase, in other words, together with
>>> the
 already committed work from HBASE-25391, make sure flush, compactions,
 splits and merges all can take advantage of the persistent store file
 manager and complete with no need to rely on renames. These all map to
>>> the
 substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
>>> and
 validate this works well for our goals, we can then focus on snapshots,
 bulkloading and tooling.
 
 S3 now supports strong consistency, and I heard that they are also
> implementing atomic renaming currently, so maybe that's one of the
 reasons
> why the development is silent now..
> 
 Interesting, I had no idea this was being implemented. I know,
>> however, a
 version of this feature is already available on latest EMR releases (at
 least from 6.2.0), and AWS team has published their own blog post with
 their results:
 
 
>>> 
>> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
 
 But I do not think store hfile list in meta is the only solution. It
>> will
> cause cyclic dependencies for hbase:meta, and then force us a have a
> fallback solution which makes the code a bit ugly. We should try to
>> see
 if
> this could be done with only the FileSystem.
> 
 This is indeed a relevant concern. One idea I had mentioned in the
>>> original
 design doc was to track committed/non-committed files through xattr (or
 tags), which may have its own performance issues as explained by
>> Stephen
 Wu, but is something that could be attempted.
 
 Em qua., 19 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-22 Thread Duo Zhang
I could put up a simple design doc for this.

But there is still a problem, about how to do rolling upgrading.

After we changed the behavior, the region server will write partial store
files directly into the data directory. For new region servers, this is not
a problem, as we will read the hfilelist file to find out the valid store
files.
But when rolling upgrading, we can not upgrade all the regionservers at
once, for old regionservers, they will initialize a store by listing the
store files, so if a new regionserver crashes when compacting and its
regions are assigned to old regionservers, the old regionservers will be in
trouble...

Stack  于2021年5月22日周六 下午12:14写道:

> HBASE-24749 design and implementation had acknowledged compromises on
> review: e.g. adding a new 'system table' to hold store files.  I'd suggest
> the design and implementation need a revisit before we go forward; for
> instance, factoring for systems other than s3 as suggested above (I like
> the Duo list).
>
> S
>
> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> wrote:
>
> > What about just storing the hfile list in a file? Since now S3 has strong
> > consistency, we could safely overwrite a file then I think?
> >
> > And since the hfile list file will be very small, renaming will not be a
> > big problem.
> >
> > We could write the hfile list to a file called 'hfile.list.tmp', and then
> > rename it to 'hfile.list'.
> >
> > This is safe for HDFS, and for S3, since it is not atomic, maybe we could
> > face that, the 'hfile.list' file is not there, but there is a
> > 'hfile.list.tmp'.
> >
> > So when opening a HStore, we first check if 'hfile.list' is there, if
> not,
> > try 'hfile.list.tmp', rename it and load it. For safety, we could write
> an
> > initial hfile list file with no hfiles. So if we can not load either
> > 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> users
> > should try to fix  it with HBCK.
> > And in HBCK, we will do a listing and generate the 'hfile.list' file.
> >
> > WDYT?
> >
> > Thanks.
> >
> > Wellington Chevreuil  于2021年5月19日周三
> > 下午10:43写道:
> >
> > > Thank you, Andrew and Duo,
> > >
> > > Talking internally with Josh Elser, initial idea was to rebase the
> > feature
> > > branch with master (in order to catch with latest commits), then focus
> on
> > > work to have a minimal functioning hbase, in other words, together with
> > the
> > > already committed work from HBASE-25391, make sure flush, compactions,
> > > splits and merges all can take advantage of the persistent store file
> > > manager and complete with no need to rely on renames. These all map to
> > the
> > > substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> > and
> > > validate this works well for our goals, we can then focus on snapshots,
> > > bulkloading and tooling.
> > >
> > > S3 now supports strong consistency, and I heard that they are also
> > > > implementing atomic renaming currently, so maybe that's one of the
> > > reasons
> > > > why the development is silent now..
> > > >
> > > Interesting, I had no idea this was being implemented. I know,
> however, a
> > > version of this feature is already available on latest EMR releases (at
> > > least from 6.2.0), and AWS team has published their own blog post with
> > > their results:
> > >
> > >
> >
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> > >
> > > But I do not think store hfile list in meta is the only solution. It
> will
> > > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > > fallback solution which makes the code a bit ugly. We should try to
> see
> > > if
> > > > this could be done with only the FileSystem.
> > > >
> > > This is indeed a relevant concern. One idea I had mentioned in the
> > original
> > > design doc was to track committed/non-committed files through xattr (or
> > > tags), which may have its own performance issues as explained by
> Stephen
> > > Wu, but is something that could be attempted.
> > >
> > > Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
> > palomino...@gmail.com
> > > >
> > > escreveu:
> > >
> > > > S3 now supports strong consistency, and I heard that they are also
> > > > implementing atomic renaming currently, so maybe that's one of the
> > > reasons
> > > > why the development is silent now...
> > > >
> > > > For me, I also think deploying hbase on cloud storage is the future,
> > so I
> > > > would also like to participate here.
> > > >
> > > > But I do not think store hfile list in meta is the only solution. It
> > will
> > > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > > fallback solution which makes the code a bit ugly. We should try to
> see
> > > if
> > > > this could be done with only the FileSystem.
> > > >
> > > > Thanks.
> > > >
> > > > Andrew Purtell  于2021年5月19日周三 上午8:04写道:
> > > >
> > > > > Wellington (and et. al),
> > > > >
> > 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-21 Thread Stack
HBASE-24749 design and implementation had acknowledged compromises on
review: e.g. adding a new 'system table' to hold store files.  I'd suggest
the design and implementation need a revisit before we go forward; for
instance, factoring for systems other than s3 as suggested above (I like
the Duo list).

S

On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)  wrote:

> What about just storing the hfile list in a file? Since now S3 has strong
> consistency, we could safely overwrite a file then I think?
>
> And since the hfile list file will be very small, renaming will not be a
> big problem.
>
> We could write the hfile list to a file called 'hfile.list.tmp', and then
> rename it to 'hfile.list'.
>
> This is safe for HDFS, and for S3, since it is not atomic, maybe we could
> face that, the 'hfile.list' file is not there, but there is a
> 'hfile.list.tmp'.
>
> So when opening a HStore, we first check if 'hfile.list' is there, if not,
> try 'hfile.list.tmp', rename it and load it. For safety, we could write an
> initial hfile list file with no hfiles. So if we can not load either
> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so users
> should try to fix  it with HBCK.
> And in HBCK, we will do a listing and generate the 'hfile.list' file.
>
> WDYT?
>
> Thanks.
>
> Wellington Chevreuil  于2021年5月19日周三
> 下午10:43写道:
>
> > Thank you, Andrew and Duo,
> >
> > Talking internally with Josh Elser, initial idea was to rebase the
> feature
> > branch with master (in order to catch with latest commits), then focus on
> > work to have a minimal functioning hbase, in other words, together with
> the
> > already committed work from HBASE-25391, make sure flush, compactions,
> > splits and merges all can take advantage of the persistent store file
> > manager and complete with no need to rely on renames. These all map to
> the
> > substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> and
> > validate this works well for our goals, we can then focus on snapshots,
> > bulkloading and tooling.
> >
> > S3 now supports strong consistency, and I heard that they are also
> > > implementing atomic renaming currently, so maybe that's one of the
> > reasons
> > > why the development is silent now..
> > >
> > Interesting, I had no idea this was being implemented. I know, however, a
> > version of this feature is already available on latest EMR releases (at
> > least from 6.2.0), and AWS team has published their own blog post with
> > their results:
> >
> >
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> >
> > But I do not think store hfile list in meta is the only solution. It will
> > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > fallback solution which makes the code a bit ugly. We should try to see
> > if
> > > this could be done with only the FileSystem.
> > >
> > This is indeed a relevant concern. One idea I had mentioned in the
> original
> > design doc was to track committed/non-committed files through xattr (or
> > tags), which may have its own performance issues as explained by Stephen
> > Wu, but is something that could be attempted.
> >
> > Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
> palomino...@gmail.com
> > >
> > escreveu:
> >
> > > S3 now supports strong consistency, and I heard that they are also
> > > implementing atomic renaming currently, so maybe that's one of the
> > reasons
> > > why the development is silent now...
> > >
> > > For me, I also think deploying hbase on cloud storage is the future,
> so I
> > > would also like to participate here.
> > >
> > > But I do not think store hfile list in meta is the only solution. It
> will
> > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > fallback solution which makes the code a bit ugly. We should try to see
> > if
> > > this could be done with only the FileSystem.
> > >
> > > Thanks.
> > >
> > > Andrew Purtell  于2021年5月19日周三 上午8:04写道:
> > >
> > > > Wellington (and et. al),
> > > >
> > > > S3 is also an important piece of our future production plans.
> > > > Unfortunately,  we were unable to assist much with last year's work,
> on
> > > > account of being sidetracked by more immediate concerns. Fortunately,
> > > this
> > > > renewed interest is timely in that we have an HBase 2 project where,
> if
> > > > this can land in a 2.5 or a 2.6, it could be an important cost to
> serve
> > > > optimization, and one we could and would make use of. Therefore I
> would
> > > > like to restate my employer's interest in this work too. It may just
> be
> > > > Viraj and myself in the early days.
> > > >
> > > > I'm not sure how best to collaborate. We could review changes from
> the
> > > > original authors, new changes, and/or divide up the development
> tasks.
> > We
> > > > can certainly offer our time for testing, and can afford the costs of
> > > > testing against the S3 service.
> > 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-21 Thread Duo Zhang
So maybe we could introduce a .hfilelist directory, and put the hflielist
files under this directory, so we do not need to list all the files under
the region directory.

And considering the possible implementation for typical object storages,
listing the last directory on the whole path will be less expensive.

Andrew Purtell  于2021年5月22日周六 上午9:35写道:

>
> > On May 21, 2021, at 6:07 PM, 张铎  wrote:
> >
> > Since we just make use of the general FileSystem API to do listing, is
> it
> > possible to make use of ' bucket index listing'?
>
> Yes, those words mean the same thing.
>
> >
> > Andrew Purtell  于2021年5月22日周六 上午6:34写道:
> >
> >>
> >>
> >>> On May 20, 2021, at 4:00 AM, Wellington Chevreuil <
> >> wellington.chevre...@gmail.com> wrote:
> >>>
> >>> 
> 
> 
>  IMO it should be a file per store.
>  Per region is not suitable here as compaction is per store.
>  Per file means we still need to list all the files. And usually, after
>  compaction, we need to do an atomic operation to remove several old
> >> files
>  and add a new file, or even several files for stripe compaction. It
> >> will be
>  easy if we just write one file to commit these changes.
> 
> >>>
> >>> Fine for me if it's simpler. Mentioned the per file approach because I
> >>> thought it could be easier/faster to do that, rather than having to
> >> update
> >>> the store file list on every flush. AFAIK, append is out of the table,
> so
> >>> updating this file would mean read it, write original content plus new
> >>> hfile to a temp file, delete original file, rename it).
> >>>
> >>
> >> That sounds right to be.
> >>
> >> A minor potential optimization is the filename could have a timestamp
> >> component, so a bucket index listing at that path would pick up a list
> >> including the latest, and the latest would be used as the manifest of
> valid
> >> store files. The cloud object store is expected to provide an atomic
> >> listing semantic where the file is written and closed and only then is
> it
> >> visible, and it is visible at once to everyone. (I think this is
> available
> >> on most.) Old manifest file versions could be lazily deleted.
> >>
> >>
>  Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <
> >> palomino...@gmail.com>
>  escreveu:
> 
>  IIRC S3 is the only object storage which does not guarantee
>  read-after-write consistency in the past...
> 
>  This is the quick result after googling
> 
>  AWS [1]
> 
> > Amazon S3 delivers strong read-after-write consistency automatically
> >> for
> > all applications
> 
> 
>  Azure[2]
> 
> > Azure Storage was designed to embrace a strong consistency model that
> > guarantees that after the service performs an insert or update
> >> operation,
> > subsequent read operations return the latest update.
> 
> 
>  Aliyun[3]
> 
> > A feature requires that object operations in OSS be atomic, which
> > indicates that operations can only either succeed or fail without
> > intermediate states. To ensure that users can access only complete
> >> data,
> > OSS does not return corrupted or partial data.
> >
> > Object operations in OSS are highly consistent. For example, when a
> >> user
> > receives an upload (PUT) success response, the uploaded object can be
>  read
> > immediately, and copies of the object are written to multiple devices
> >> for
> > redundancy. Therefore, the situations where data is not obtained when
> >> you
> > perform the read-after-write operation do not exist. The same is true
> >> for
> > delete operations. After you delete an object, the object and its
> >> copies
>  no
> > longer exist.
> >
> 
>  GCP[4]
> 
> > Cloud Storage provides strong global consistency for the following
> > operations, including both data and metadata:
> >
> > Read-after-write
> > Read-after-metadata-update
> > Read-after-delete
> > Bucket listing
> > Object listing
> >
> 
>  I think these vendors could cover most end users in the world?
> 
>  1. https://aws.amazon.com/cn/s3/consistency/
>  2.
> 
> 
> >>
> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
>  3. https://www.alibabacloud.com/help/doc-detail/31827.htm
>  4. https://cloud.google.com/storage/docs/consistency
> 
>  Nick Dimiduk  于2021年5月19日周三 下午11:40写道:
> 
> > On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)  >
> > wrote:
> >
> >> What about just storing the hfile list in a file? Since now S3 has
>  strong
> >> consistency, we could safely overwrite a file then I think?
> >>
> >
> > My concern is about portability. S3 isn't the only blob store in
> town,
>  and
> > consistent read-what-you-wrote semantics are not a standard feature,
> as
>  far
> > as I know. If we want something 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-21 Thread Andrew Purtell


> On May 21, 2021, at 6:07 PM, 张铎  wrote:
> 
> Since we just make use of the general FileSystem API to do listing, is it
> possible to make use of ' bucket index listing'?

Yes, those words mean the same thing. 

> 
> Andrew Purtell  于2021年5月22日周六 上午6:34写道:
> 
>> 
>> 
>>> On May 20, 2021, at 4:00 AM, Wellington Chevreuil <
>> wellington.chevre...@gmail.com> wrote:
>>> 
>>> 
 
 
 IMO it should be a file per store.
 Per region is not suitable here as compaction is per store.
 Per file means we still need to list all the files. And usually, after
 compaction, we need to do an atomic operation to remove several old
>> files
 and add a new file, or even several files for stripe compaction. It
>> will be
 easy if we just write one file to commit these changes.
 
>>> 
>>> Fine for me if it's simpler. Mentioned the per file approach because I
>>> thought it could be easier/faster to do that, rather than having to
>> update
>>> the store file list on every flush. AFAIK, append is out of the table, so
>>> updating this file would mean read it, write original content plus new
>>> hfile to a temp file, delete original file, rename it).
>>> 
>> 
>> That sounds right to be.
>> 
>> A minor potential optimization is the filename could have a timestamp
>> component, so a bucket index listing at that path would pick up a list
>> including the latest, and the latest would be used as the manifest of valid
>> store files. The cloud object store is expected to provide an atomic
>> listing semantic where the file is written and closed and only then is it
>> visible, and it is visible at once to everyone. (I think this is available
>> on most.) Old manifest file versions could be lazily deleted.
>> 
>> 
 Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <
>> palomino...@gmail.com>
 escreveu:
 
 IIRC S3 is the only object storage which does not guarantee
 read-after-write consistency in the past...
 
 This is the quick result after googling
 
 AWS [1]
 
> Amazon S3 delivers strong read-after-write consistency automatically
>> for
> all applications
 
 
 Azure[2]
 
> Azure Storage was designed to embrace a strong consistency model that
> guarantees that after the service performs an insert or update
>> operation,
> subsequent read operations return the latest update.
 
 
 Aliyun[3]
 
> A feature requires that object operations in OSS be atomic, which
> indicates that operations can only either succeed or fail without
> intermediate states. To ensure that users can access only complete
>> data,
> OSS does not return corrupted or partial data.
> 
> Object operations in OSS are highly consistent. For example, when a
>> user
> receives an upload (PUT) success response, the uploaded object can be
 read
> immediately, and copies of the object are written to multiple devices
>> for
> redundancy. Therefore, the situations where data is not obtained when
>> you
> perform the read-after-write operation do not exist. The same is true
>> for
> delete operations. After you delete an object, the object and its
>> copies
 no
> longer exist.
> 
 
 GCP[4]
 
> Cloud Storage provides strong global consistency for the following
> operations, including both data and metadata:
> 
> Read-after-write
> Read-after-metadata-update
> Read-after-delete
> Bucket listing
> Object listing
> 
 
 I think these vendors could cover most end users in the world?
 
 1. https://aws.amazon.com/cn/s3/consistency/
 2.
 
 
>> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
 4. https://cloud.google.com/storage/docs/consistency
 
 Nick Dimiduk  于2021年5月19日周三 下午11:40写道:
 
> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> wrote:
> 
>> What about just storing the hfile list in a file? Since now S3 has
 strong
>> consistency, we could safely overwrite a file then I think?
>> 
> 
> My concern is about portability. S3 isn't the only blob store in town,
 and
> consistent read-what-you-wrote semantics are not a standard feature, as
 far
> as I know. If we want something that can work on 3 or 5 major public
 cloud
> blobstore products as well as a smattering of on-prem technologies, we
> should be selective about what features we choose to rely on as
> foundational to our implementation.
> 
> Or we are explicitly saying this will only work on S3 and we'll only
> support other services when they can achieve this level of
>> compatibility.
> 
> Either way, we should be clear and up-front about what semantics we
 demand.
> Implementing some kind of a test harness that can check compatibility
 would
> help 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-21 Thread Duo Zhang
Since we just make use of the general FileSystem API to do listing, is it
possible to make use of ' bucket index listing'?

Andrew Purtell  于2021年5月22日周六 上午6:34写道:

>
>
> > On May 20, 2021, at 4:00 AM, Wellington Chevreuil <
> wellington.chevre...@gmail.com> wrote:
> >
> > 
> >>
> >>
> >> IMO it should be a file per store.
> >> Per region is not suitable here as compaction is per store.
> >> Per file means we still need to list all the files. And usually, after
> >> compaction, we need to do an atomic operation to remove several old
> files
> >> and add a new file, or even several files for stripe compaction. It
> will be
> >> easy if we just write one file to commit these changes.
> >>
> >
> > Fine for me if it's simpler. Mentioned the per file approach because I
> > thought it could be easier/faster to do that, rather than having to
> update
> > the store file list on every flush. AFAIK, append is out of the table, so
> > updating this file would mean read it, write original content plus new
> > hfile to a temp file, delete original file, rename it).
> >
>
> That sounds right to be.
>
> A minor potential optimization is the filename could have a timestamp
> component, so a bucket index listing at that path would pick up a list
> including the latest, and the latest would be used as the manifest of valid
> store files. The cloud object store is expected to provide an atomic
> listing semantic where the file is written and closed and only then is it
> visible, and it is visible at once to everyone. (I think this is available
> on most.) Old manifest file versions could be lazily deleted.
>
>
> >> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <
> palomino...@gmail.com>
> >> escreveu:
> >>
> >> IIRC S3 is the only object storage which does not guarantee
> >> read-after-write consistency in the past...
> >>
> >> This is the quick result after googling
> >>
> >> AWS [1]
> >>
> >>> Amazon S3 delivers strong read-after-write consistency automatically
> for
> >>> all applications
> >>
> >>
> >> Azure[2]
> >>
> >>> Azure Storage was designed to embrace a strong consistency model that
> >>> guarantees that after the service performs an insert or update
> operation,
> >>> subsequent read operations return the latest update.
> >>
> >>
> >> Aliyun[3]
> >>
> >>> A feature requires that object operations in OSS be atomic, which
> >>> indicates that operations can only either succeed or fail without
> >>> intermediate states. To ensure that users can access only complete
> data,
> >>> OSS does not return corrupted or partial data.
> >>>
> >>> Object operations in OSS are highly consistent. For example, when a
> user
> >>> receives an upload (PUT) success response, the uploaded object can be
> >> read
> >>> immediately, and copies of the object are written to multiple devices
> for
> >>> redundancy. Therefore, the situations where data is not obtained when
> you
> >>> perform the read-after-write operation do not exist. The same is true
> for
> >>> delete operations. After you delete an object, the object and its
> copies
> >> no
> >>> longer exist.
> >>>
> >>
> >> GCP[4]
> >>
> >>> Cloud Storage provides strong global consistency for the following
> >>> operations, including both data and metadata:
> >>>
> >>> Read-after-write
> >>> Read-after-metadata-update
> >>> Read-after-delete
> >>> Bucket listing
> >>> Object listing
> >>>
> >>
> >> I think these vendors could cover most end users in the world?
> >>
> >> 1. https://aws.amazon.com/cn/s3/consistency/
> >> 2.
> >>
> >>
> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
> >> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
> >> 4. https://cloud.google.com/storage/docs/consistency
> >>
> >> Nick Dimiduk  于2021年5月19日周三 下午11:40写道:
> >>
> >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> >>> wrote:
> >>>
>  What about just storing the hfile list in a file? Since now S3 has
> >> strong
>  consistency, we could safely overwrite a file then I think?
> 
> >>>
> >>> My concern is about portability. S3 isn't the only blob store in town,
> >> and
> >>> consistent read-what-you-wrote semantics are not a standard feature, as
> >> far
> >>> as I know. If we want something that can work on 3 or 5 major public
> >> cloud
> >>> blobstore products as well as a smattering of on-prem technologies, we
> >>> should be selective about what features we choose to rely on as
> >>> foundational to our implementation.
> >>>
> >>> Or we are explicitly saying this will only work on S3 and we'll only
> >>> support other services when they can achieve this level of
> compatibility.
> >>>
> >>> Either way, we should be clear and up-front about what semantics we
> >> demand.
> >>> Implementing some kind of a test harness that can check compatibility
> >> would
> >>> help here, a similar effort to that of defining standard behaviors of
> >> HDFS
> >>> implementations.
> >>>
> >>> I love this discussion :)
> >>>
> >>> And since the hfile 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-21 Thread Andrew Purtell



> On May 20, 2021, at 4:00 AM, Wellington Chevreuil 
>  wrote:
> 
> 
>> 
>> 
>> IMO it should be a file per store.
>> Per region is not suitable here as compaction is per store.
>> Per file means we still need to list all the files. And usually, after
>> compaction, we need to do an atomic operation to remove several old files
>> and add a new file, or even several files for stripe compaction. It will be
>> easy if we just write one file to commit these changes.
>> 
> 
> Fine for me if it's simpler. Mentioned the per file approach because I
> thought it could be easier/faster to do that, rather than having to update
> the store file list on every flush. AFAIK, append is out of the table, so
> updating this file would mean read it, write original content plus new
> hfile to a temp file, delete original file, rename it).
> 

That sounds right to be. 

A minor potential optimization is the filename could have a timestamp 
component, so a bucket index listing at that path would pick up a list 
including the latest, and the latest would be used as the manifest of valid 
store files. The cloud object store is expected to provide an atomic listing 
semantic where the file is written and closed and only then is it visible, and 
it is visible at once to everyone. (I think this is available on most.) Old 
manifest file versions could be lazily deleted. 


>> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) 
>> escreveu:
>> 
>> IIRC S3 is the only object storage which does not guarantee
>> read-after-write consistency in the past...
>> 
>> This is the quick result after googling
>> 
>> AWS [1]
>> 
>>> Amazon S3 delivers strong read-after-write consistency automatically for
>>> all applications
>> 
>> 
>> Azure[2]
>> 
>>> Azure Storage was designed to embrace a strong consistency model that
>>> guarantees that after the service performs an insert or update operation,
>>> subsequent read operations return the latest update.
>> 
>> 
>> Aliyun[3]
>> 
>>> A feature requires that object operations in OSS be atomic, which
>>> indicates that operations can only either succeed or fail without
>>> intermediate states. To ensure that users can access only complete data,
>>> OSS does not return corrupted or partial data.
>>> 
>>> Object operations in OSS are highly consistent. For example, when a user
>>> receives an upload (PUT) success response, the uploaded object can be
>> read
>>> immediately, and copies of the object are written to multiple devices for
>>> redundancy. Therefore, the situations where data is not obtained when you
>>> perform the read-after-write operation do not exist. The same is true for
>>> delete operations. After you delete an object, the object and its copies
>> no
>>> longer exist.
>>> 
>> 
>> GCP[4]
>> 
>>> Cloud Storage provides strong global consistency for the following
>>> operations, including both data and metadata:
>>> 
>>> Read-after-write
>>> Read-after-metadata-update
>>> Read-after-delete
>>> Bucket listing
>>> Object listing
>>> 
>> 
>> I think these vendors could cover most end users in the world?
>> 
>> 1. https://aws.amazon.com/cn/s3/consistency/
>> 2.
>> 
>> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
>> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
>> 4. https://cloud.google.com/storage/docs/consistency
>> 
>> Nick Dimiduk  于2021年5月19日周三 下午11:40写道:
>> 
>>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
>>> wrote:
>>> 
 What about just storing the hfile list in a file? Since now S3 has
>> strong
 consistency, we could safely overwrite a file then I think?
 
>>> 
>>> My concern is about portability. S3 isn't the only blob store in town,
>> and
>>> consistent read-what-you-wrote semantics are not a standard feature, as
>> far
>>> as I know. If we want something that can work on 3 or 5 major public
>> cloud
>>> blobstore products as well as a smattering of on-prem technologies, we
>>> should be selective about what features we choose to rely on as
>>> foundational to our implementation.
>>> 
>>> Or we are explicitly saying this will only work on S3 and we'll only
>>> support other services when they can achieve this level of compatibility.
>>> 
>>> Either way, we should be clear and up-front about what semantics we
>> demand.
>>> Implementing some kind of a test harness that can check compatibility
>> would
>>> help here, a similar effort to that of defining standard behaviors of
>> HDFS
>>> implementations.
>>> 
>>> I love this discussion :)
>>> 
>>> And since the hfile list file will be very small, renaming will not be a
 big problem.
 
>>> 
>>> Would this be a file per store? A file per region? Ah. Below you imply
>> it's
>>> per store.
>>> 
>>> Wellington Chevreuil  于2021年5月19日周三
 下午10:43写道:
 
> Thank you, Andrew and Duo,
> 
> Talking internally with Josh Elser, initial idea was to rebase the
 feature
> branch with master (in order to catch with latest commits), then
>> focus

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-20 Thread Wellington Chevreuil
>
> IMO it should be a file per store.
> Per region is not suitable here as compaction is per store.
> Per file means we still need to list all the files. And usually, after
> compaction, we need to do an atomic operation to remove several old files
> and add a new file, or even several files for stripe compaction. It will be
> easy if we just write one file to commit these changes.
>

Fine for me if it's simpler. Mentioned the per file approach because I
thought it could be easier/faster to do that, rather than having to update
the store file list on every flush. AFAIK, append is out of the table, so
updating this file would mean read it, write original content plus new
hfile to a temp file, delete original file, rename it).

Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) 
escreveu:

> IIRC S3 is the only object storage which does not guarantee
> read-after-write consistency in the past...
>
> This is the quick result after googling
>
> AWS [1]
>
> > Amazon S3 delivers strong read-after-write consistency automatically for
> > all applications
>
>
> Azure[2]
>
> > Azure Storage was designed to embrace a strong consistency model that
> > guarantees that after the service performs an insert or update operation,
> > subsequent read operations return the latest update.
>
>
> Aliyun[3]
>
> > A feature requires that object operations in OSS be atomic, which
> > indicates that operations can only either succeed or fail without
> > intermediate states. To ensure that users can access only complete data,
> > OSS does not return corrupted or partial data.
> >
> > Object operations in OSS are highly consistent. For example, when a user
> > receives an upload (PUT) success response, the uploaded object can be
> read
> > immediately, and copies of the object are written to multiple devices for
> > redundancy. Therefore, the situations where data is not obtained when you
> > perform the read-after-write operation do not exist. The same is true for
> > delete operations. After you delete an object, the object and its copies
> no
> > longer exist.
> >
>
> GCP[4]
>
> > Cloud Storage provides strong global consistency for the following
> > operations, including both data and metadata:
> >
> > Read-after-write
> > Read-after-metadata-update
> > Read-after-delete
> > Bucket listing
> > Object listing
> >
>
> I think these vendors could cover most end users in the world?
>
> 1. https://aws.amazon.com/cn/s3/consistency/
> 2.
>
> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
> 4. https://cloud.google.com/storage/docs/consistency
>
> Nick Dimiduk  于2021年5月19日周三 下午11:40写道:
>
> > On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> > wrote:
> >
> > > What about just storing the hfile list in a file? Since now S3 has
> strong
> > > consistency, we could safely overwrite a file then I think?
> > >
> >
> > My concern is about portability. S3 isn't the only blob store in town,
> and
> > consistent read-what-you-wrote semantics are not a standard feature, as
> far
> > as I know. If we want something that can work on 3 or 5 major public
> cloud
> > blobstore products as well as a smattering of on-prem technologies, we
> > should be selective about what features we choose to rely on as
> > foundational to our implementation.
> >
> > Or we are explicitly saying this will only work on S3 and we'll only
> > support other services when they can achieve this level of compatibility.
> >
> > Either way, we should be clear and up-front about what semantics we
> demand.
> > Implementing some kind of a test harness that can check compatibility
> would
> > help here, a similar effort to that of defining standard behaviors of
> HDFS
> > implementations.
> >
> > I love this discussion :)
> >
> > And since the hfile list file will be very small, renaming will not be a
> > > big problem.
> > >
> >
> > Would this be a file per store? A file per region? Ah. Below you imply
> it's
> > per store.
> >
> > Wellington Chevreuil  于2021年5月19日周三
> > > 下午10:43写道:
> > >
> > > > Thank you, Andrew and Duo,
> > > >
> > > > Talking internally with Josh Elser, initial idea was to rebase the
> > > feature
> > > > branch with master (in order to catch with latest commits), then
> focus
> > on
> > > > work to have a minimal functioning hbase, in other words, together
> with
> > > the
> > > > already committed work from HBASE-25391, make sure flush,
> compactions,
> > > > splits and merges all can take advantage of the persistent store file
> > > > manager and complete with no need to rely on renames. These all map
> to
> > > the
> > > > substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could
> test
> > > and
> > > > validate this works well for our goals, we can then focus on
> snapshots,
> > > > bulkloading and tooling.
> > > >
> > > > S3 now supports strong consistency, and I heard that they are also
> > > > > implementing atomic renaming currently, so maybe 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-19 Thread Duo Zhang
IIRC S3 is the only object storage which does not guarantee
read-after-write consistency in the past...

This is the quick result after googling

AWS [1]

> Amazon S3 delivers strong read-after-write consistency automatically for
> all applications


Azure[2]

> Azure Storage was designed to embrace a strong consistency model that
> guarantees that after the service performs an insert or update operation,
> subsequent read operations return the latest update.


Aliyun[3]

> A feature requires that object operations in OSS be atomic, which
> indicates that operations can only either succeed or fail without
> intermediate states. To ensure that users can access only complete data,
> OSS does not return corrupted or partial data.
>
> Object operations in OSS are highly consistent. For example, when a user
> receives an upload (PUT) success response, the uploaded object can be read
> immediately, and copies of the object are written to multiple devices for
> redundancy. Therefore, the situations where data is not obtained when you
> perform the read-after-write operation do not exist. The same is true for
> delete operations. After you delete an object, the object and its copies no
> longer exist.
>

GCP[4]

> Cloud Storage provides strong global consistency for the following
> operations, including both data and metadata:
>
> Read-after-write
> Read-after-metadata-update
> Read-after-delete
> Bucket listing
> Object listing
>

I think these vendors could cover most end users in the world?

1. https://aws.amazon.com/cn/s3/consistency/
2.
https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
3. https://www.alibabacloud.com/help/doc-detail/31827.htm
4. https://cloud.google.com/storage/docs/consistency

Nick Dimiduk  于2021年5月19日周三 下午11:40写道:

> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> wrote:
>
> > What about just storing the hfile list in a file? Since now S3 has strong
> > consistency, we could safely overwrite a file then I think?
> >
>
> My concern is about portability. S3 isn't the only blob store in town, and
> consistent read-what-you-wrote semantics are not a standard feature, as far
> as I know. If we want something that can work on 3 or 5 major public cloud
> blobstore products as well as a smattering of on-prem technologies, we
> should be selective about what features we choose to rely on as
> foundational to our implementation.
>
> Or we are explicitly saying this will only work on S3 and we'll only
> support other services when they can achieve this level of compatibility.
>
> Either way, we should be clear and up-front about what semantics we demand.
> Implementing some kind of a test harness that can check compatibility would
> help here, a similar effort to that of defining standard behaviors of HDFS
> implementations.
>
> I love this discussion :)
>
> And since the hfile list file will be very small, renaming will not be a
> > big problem.
> >
>
> Would this be a file per store? A file per region? Ah. Below you imply it's
> per store.
>
> Wellington Chevreuil  于2021年5月19日周三
> > 下午10:43写道:
> >
> > > Thank you, Andrew and Duo,
> > >
> > > Talking internally with Josh Elser, initial idea was to rebase the
> > feature
> > > branch with master (in order to catch with latest commits), then focus
> on
> > > work to have a minimal functioning hbase, in other words, together with
> > the
> > > already committed work from HBASE-25391, make sure flush, compactions,
> > > splits and merges all can take advantage of the persistent store file
> > > manager and complete with no need to rely on renames. These all map to
> > the
> > > substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> > and
> > > validate this works well for our goals, we can then focus on snapshots,
> > > bulkloading and tooling.
> > >
> > > S3 now supports strong consistency, and I heard that they are also
> > > > implementing atomic renaming currently, so maybe that's one of the
> > > reasons
> > > > why the development is silent now..
> > > >
> > > Interesting, I had no idea this was being implemented. I know,
> however, a
> > > version of this feature is already available on latest EMR releases (at
> > > least from 6.2.0), and AWS team has published their own blog post with
> > > their results:
> > >
> > >
> >
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> > >
> > > But I do not think store hfile list in meta is the only solution. It
> will
> > > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > > fallback solution which makes the code a bit ugly. We should try to
> see
> > > if
> > > > this could be done with only the FileSystem.
> > > >
> > > This is indeed a relevant concern. One idea I had mentioned in the
> > original
> > > design doc was to track committed/non-committed files through xattr (or
> > > tags), which may have its own performance issues as explained 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-19 Thread Duo Zhang
Oh, just saw your last comment.

IMO it should be a file per store.
Per region is not suitable here as compaction is per store.
Per file means we still need to list all the files. And usually, after
compaction, we need to do an atomic operation to remove several old files
and add a new file, or even several files for stripe compaction. It will be
easy if we just write one file to commit these changes.

Thanks.

Nick Dimiduk  于2021年5月19日周三 下午11:40写道:

> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> wrote:
>
> > What about just storing the hfile list in a file? Since now S3 has strong
> > consistency, we could safely overwrite a file then I think?
> >
>
> My concern is about portability. S3 isn't the only blob store in town, and
> consistent read-what-you-wrote semantics are not a standard feature, as far
> as I know. If we want something that can work on 3 or 5 major public cloud
> blobstore products as well as a smattering of on-prem technologies, we
> should be selective about what features we choose to rely on as
> foundational to our implementation.
>
> Or we are explicitly saying this will only work on S3 and we'll only
> support other services when they can achieve this level of compatibility.
>
> Either way, we should be clear and up-front about what semantics we demand.
> Implementing some kind of a test harness that can check compatibility would
> help here, a similar effort to that of defining standard behaviors of HDFS
> implementations.
>
> I love this discussion :)
>
> And since the hfile list file will be very small, renaming will not be a
> > big problem.
> >
>
> Would this be a file per store? A file per region? Ah. Below you imply it's
> per store.
>
> Wellington Chevreuil  于2021年5月19日周三
> > 下午10:43写道:
> >
> > > Thank you, Andrew and Duo,
> > >
> > > Talking internally with Josh Elser, initial idea was to rebase the
> > feature
> > > branch with master (in order to catch with latest commits), then focus
> on
> > > work to have a minimal functioning hbase, in other words, together with
> > the
> > > already committed work from HBASE-25391, make sure flush, compactions,
> > > splits and merges all can take advantage of the persistent store file
> > > manager and complete with no need to rely on renames. These all map to
> > the
> > > substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> > and
> > > validate this works well for our goals, we can then focus on snapshots,
> > > bulkloading and tooling.
> > >
> > > S3 now supports strong consistency, and I heard that they are also
> > > > implementing atomic renaming currently, so maybe that's one of the
> > > reasons
> > > > why the development is silent now..
> > > >
> > > Interesting, I had no idea this was being implemented. I know,
> however, a
> > > version of this feature is already available on latest EMR releases (at
> > > least from 6.2.0), and AWS team has published their own blog post with
> > > their results:
> > >
> > >
> >
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> > >
> > > But I do not think store hfile list in meta is the only solution. It
> will
> > > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > > fallback solution which makes the code a bit ugly. We should try to
> see
> > > if
> > > > this could be done with only the FileSystem.
> > > >
> > > This is indeed a relevant concern. One idea I had mentioned in the
> > original
> > > design doc was to track committed/non-committed files through xattr (or
> > > tags), which may have its own performance issues as explained by
> Stephen
> > > Wu, but is something that could be attempted.
> > >
> > > Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
> > palomino...@gmail.com
> > > >
> > > escreveu:
> > >
> > > > S3 now supports strong consistency, and I heard that they are also
> > > > implementing atomic renaming currently, so maybe that's one of the
> > > reasons
> > > > why the development is silent now...
> > > >
> > > > For me, I also think deploying hbase on cloud storage is the future,
> > so I
> > > > would also like to participate here.
> > > >
> > > > But I do not think store hfile list in meta is the only solution. It
> > will
> > > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > > fallback solution which makes the code a bit ugly. We should try to
> see
> > > if
> > > > this could be done with only the FileSystem.
> > > >
> > > > Thanks.
> > > >
> > > > Andrew Purtell  于2021年5月19日周三 上午8:04写道:
> > > >
> > > > > Wellington (and et. al),
> > > > >
> > > > > S3 is also an important piece of our future production plans.
> > > > > Unfortunately,  we were unable to assist much with last year's
> work,
> > on
> > > > > account of being sidetracked by more immediate concerns.
> Fortunately,
> > > > this
> > > > > renewed interest is timely in that we have an HBase 2 project
> where,
> > if
> > > > > this 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-19 Thread Wellington Chevreuil
I like the idea of tracking via files in the store. We might even do a
single "hfile.commit" file for each "hfile" that got committed and has to
be loaded. Once the store is opening, any hfile that doesn't have a
corresponding .commit file should not be loaded, then. That discards the
need for rename. Obviously relies on the strong create file consistency now
supported by S3, as Nick mentioned, we would need to define that as a
minimum for any object store we aim to support. And there's still the Store
Engine already proposed by HBASE-25395 that uses an extra table for
tracking, depending on how testing goes, we could offer that as a less
efficient implementation to be used with the file system that lack such
semantics.

Em qua., 19 de mai. de 2021 às 17:28, Andrew Purtell <
andrew.purt...@gmail.com> escreveu:

> Consistent read what you wrote bucket metadata operations are standard now
> for S3, Google’s GCS, and anyone who uses Ceph via its radios-gw. I think
> it will be table stakes for cloud object storage. Although clients will all
> see the latest metadata state for an object updated in an atomic way, this
> is not a guarantee regarding views over blob contents. It may be fine but
> we will have to survey the real semantics of public cloud object stores. We
> can pick two or three public cloud providers - I would nominate Amazon and
> Alibaba’s public cloud products - as the design targets for the initial
> implementation. I like the idea of borrowing from what Hadoop did to define
> the FileSystem semantics contract and conformance test suite.
>
> I view the current state of things as a starting point not a settled
> implementation.
>
> Hfile tracking cannot be done in meta. Meta is not a scalable place to
> store state because it cannot be split. Even the minimal state we store
> there now becomes unwieldy as the number of regions and tables in a cluster
> grows large. In order to take this into production we require the results
> of this work to be ultimately committed to branch-2 and made available in
> new minor release from there. It can’t have a design dependency on
> something that either doesn’t exist or cannot be released except with a
> major version increment. We don’t have a path to a releasable branch-2
> implementation of a splittable meta table. I hope we can find agreement
> about this design constraint.
>
>
> > On May 19, 2021, at 8:40 AM, Nick Dimiduk  wrote:
> >
> > On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) 
> wrote:
> >
> >> What about just storing the hfile list in a file? Since now S3 has
> strong
> >> consistency, we could safely overwrite a file then I think?
> >>
> >
> > My concern is about portability. S3 isn't the only blob store in town,
> and
> > consistent read-what-you-wrote semantics are not a standard feature, as
> far
> > as I know. If we want something that can work on 3 or 5 major public
> cloud
> > blobstore products as well as a smattering of on-prem technologies, we
> > should be selective about what features we choose to rely on as
> > foundational to our implementation.
> >
> > Or we are explicitly saying this will only work on S3 and we'll only
> > support other services when they can achieve this level of compatibility.
> >
> > Either way, we should be clear and up-front about what semantics we
> demand.
> > Implementing some kind of a test harness that can check compatibility
> would
> > help here, a similar effort to that of defining standard behaviors of
> HDFS
> > implementations.
> >
> > I love this discussion :)
> >
> > And since the hfile list file will be very small, renaming will not be a
> >> big problem.
> >>
> >
> > Would this be a file per store? A file per region? Ah. Below you imply
> it's
> > per store.
> >
> > Wellington Chevreuil  于2021年5月19日周三
> >> 下午10:43写道:
> >>
> >>> Thank you, Andrew and Duo,
> >>>
> >>> Talking internally with Josh Elser, initial idea was to rebase the
> >> feature
> >>> branch with master (in order to catch with latest commits), then focus
> on
> >>> work to have a minimal functioning hbase, in other words, together with
> >> the
> >>> already committed work from HBASE-25391, make sure flush, compactions,
> >>> splits and merges all can take advantage of the persistent store file
> >>> manager and complete with no need to rely on renames. These all map to
> >> the
> >>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> >> and
> >>> validate this works well for our goals, we can then focus on snapshots,
> >>> bulkloading and tooling.
> >>>
> >>> S3 now supports strong consistency, and I heard that they are also
>  implementing atomic renaming currently, so maybe that's one of the
> >>> reasons
>  why the development is silent now..
> 
> >>> Interesting, I had no idea this was being implemented. I know,
> however, a
> >>> version of this feature is already available on latest EMR releases (at
> >>> least from 6.2.0), and AWS team has published their own blog post with
> >>> 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-19 Thread Andrew Purtell
Consistent read what you wrote bucket metadata operations are standard now for 
S3, Google’s GCS, and anyone who uses Ceph via its radios-gw. I think it will 
be table stakes for cloud object storage. Although clients will all see the 
latest metadata state for an object updated in an atomic way, this is not a 
guarantee regarding views over blob contents. It may be fine but we will have 
to survey the real semantics of public cloud object stores. We can pick two or 
three public cloud providers - I would nominate Amazon and Alibaba’s public 
cloud products - as the design targets for the initial implementation. I like 
the idea of borrowing from what Hadoop did to define the FileSystem semantics 
contract and conformance test suite. 

I view the current state of things as a starting point not a settled 
implementation.

Hfile tracking cannot be done in meta. Meta is not a scalable place to store 
state because it cannot be split. Even the minimal state we store there now 
becomes unwieldy as the number of regions and tables in a cluster grows large. 
In order to take this into production we require the results of this work to be 
ultimately committed to branch-2 and made available in new minor release from 
there. It can’t have a design dependency on something that either doesn’t exist 
or cannot be released except with a major version increment. We don’t have a 
path to a releasable branch-2 implementation of a splittable meta table. I hope 
we can find agreement about this design constraint. 


> On May 19, 2021, at 8:40 AM, Nick Dimiduk  wrote:
> 
> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)  wrote:
> 
>> What about just storing the hfile list in a file? Since now S3 has strong
>> consistency, we could safely overwrite a file then I think?
>> 
> 
> My concern is about portability. S3 isn't the only blob store in town, and
> consistent read-what-you-wrote semantics are not a standard feature, as far
> as I know. If we want something that can work on 3 or 5 major public cloud
> blobstore products as well as a smattering of on-prem technologies, we
> should be selective about what features we choose to rely on as
> foundational to our implementation.
> 
> Or we are explicitly saying this will only work on S3 and we'll only
> support other services when they can achieve this level of compatibility.
> 
> Either way, we should be clear and up-front about what semantics we demand.
> Implementing some kind of a test harness that can check compatibility would
> help here, a similar effort to that of defining standard behaviors of HDFS
> implementations.
> 
> I love this discussion :)
> 
> And since the hfile list file will be very small, renaming will not be a
>> big problem.
>> 
> 
> Would this be a file per store? A file per region? Ah. Below you imply it's
> per store.
> 
> Wellington Chevreuil  于2021年5月19日周三
>> 下午10:43写道:
>> 
>>> Thank you, Andrew and Duo,
>>> 
>>> Talking internally with Josh Elser, initial idea was to rebase the
>> feature
>>> branch with master (in order to catch with latest commits), then focus on
>>> work to have a minimal functioning hbase, in other words, together with
>> the
>>> already committed work from HBASE-25391, make sure flush, compactions,
>>> splits and merges all can take advantage of the persistent store file
>>> manager and complete with no need to rely on renames. These all map to
>> the
>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
>> and
>>> validate this works well for our goals, we can then focus on snapshots,
>>> bulkloading and tooling.
>>> 
>>> S3 now supports strong consistency, and I heard that they are also
 implementing atomic renaming currently, so maybe that's one of the
>>> reasons
 why the development is silent now..
 
>>> Interesting, I had no idea this was being implemented. I know, however, a
>>> version of this feature is already available on latest EMR releases (at
>>> least from 6.2.0), and AWS team has published their own blog post with
>>> their results:
>>> 
>>> 
>> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
>>> 
>>> But I do not think store hfile list in meta is the only solution. It will
 cause cyclic dependencies for hbase:meta, and then force us a have a
 fallback solution which makes the code a bit ugly. We should try to see
>>> if
 this could be done with only the FileSystem.
 
>>> This is indeed a relevant concern. One idea I had mentioned in the
>> original
>>> design doc was to track committed/non-committed files through xattr (or
>>> tags), which may have its own performance issues as explained by Stephen
>>> Wu, but is something that could be attempted.
>>> 
>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
>> palomino...@gmail.com
 
>>> escreveu:
>>> 
 S3 now supports strong consistency, and I heard that they are also
 implementing atomic renaming currently, so 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-19 Thread Nick Dimiduk
On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)  wrote:

> What about just storing the hfile list in a file? Since now S3 has strong
> consistency, we could safely overwrite a file then I think?
>

My concern is about portability. S3 isn't the only blob store in town, and
consistent read-what-you-wrote semantics are not a standard feature, as far
as I know. If we want something that can work on 3 or 5 major public cloud
blobstore products as well as a smattering of on-prem technologies, we
should be selective about what features we choose to rely on as
foundational to our implementation.

Or we are explicitly saying this will only work on S3 and we'll only
support other services when they can achieve this level of compatibility.

Either way, we should be clear and up-front about what semantics we demand.
Implementing some kind of a test harness that can check compatibility would
help here, a similar effort to that of defining standard behaviors of HDFS
implementations.

I love this discussion :)

And since the hfile list file will be very small, renaming will not be a
> big problem.
>

Would this be a file per store? A file per region? Ah. Below you imply it's
per store.

Wellington Chevreuil  于2021年5月19日周三
> 下午10:43写道:
>
> > Thank you, Andrew and Duo,
> >
> > Talking internally with Josh Elser, initial idea was to rebase the
> feature
> > branch with master (in order to catch with latest commits), then focus on
> > work to have a minimal functioning hbase, in other words, together with
> the
> > already committed work from HBASE-25391, make sure flush, compactions,
> > splits and merges all can take advantage of the persistent store file
> > manager and complete with no need to rely on renames. These all map to
> the
> > substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> and
> > validate this works well for our goals, we can then focus on snapshots,
> > bulkloading and tooling.
> >
> > S3 now supports strong consistency, and I heard that they are also
> > > implementing atomic renaming currently, so maybe that's one of the
> > reasons
> > > why the development is silent now..
> > >
> > Interesting, I had no idea this was being implemented. I know, however, a
> > version of this feature is already available on latest EMR releases (at
> > least from 6.2.0), and AWS team has published their own blog post with
> > their results:
> >
> >
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> >
> > But I do not think store hfile list in meta is the only solution. It will
> > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > fallback solution which makes the code a bit ugly. We should try to see
> > if
> > > this could be done with only the FileSystem.
> > >
> > This is indeed a relevant concern. One idea I had mentioned in the
> original
> > design doc was to track committed/non-committed files through xattr (or
> > tags), which may have its own performance issues as explained by Stephen
> > Wu, but is something that could be attempted.
> >
> > Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
> palomino...@gmail.com
> > >
> > escreveu:
> >
> > > S3 now supports strong consistency, and I heard that they are also
> > > implementing atomic renaming currently, so maybe that's one of the
> > reasons
> > > why the development is silent now...
> > >
> > > For me, I also think deploying hbase on cloud storage is the future,
> so I
> > > would also like to participate here.
> > >
> > > But I do not think store hfile list in meta is the only solution. It
> will
> > > cause cyclic dependencies for hbase:meta, and then force us a have a
> > > fallback solution which makes the code a bit ugly. We should try to see
> > if
> > > this could be done with only the FileSystem.
> > >
> > > Thanks.
> > >
> > > Andrew Purtell  于2021年5月19日周三 上午8:04写道:
> > >
> > > > Wellington (and et. al),
> > > >
> > > > S3 is also an important piece of our future production plans.
> > > > Unfortunately,  we were unable to assist much with last year's work,
> on
> > > > account of being sidetracked by more immediate concerns. Fortunately,
> > > this
> > > > renewed interest is timely in that we have an HBase 2 project where,
> if
> > > > this can land in a 2.5 or a 2.6, it could be an important cost to
> serve
> > > > optimization, and one we could and would make use of. Therefore I
> would
> > > > like to restate my employer's interest in this work too. It may just
> be
> > > > Viraj and myself in the early days.
> > > >
> > > > I'm not sure how best to collaborate. We could review changes from
> the
> > > > original authors, new changes, and/or divide up the development
> tasks.
> > We
> > > > can certainly offer our time for testing, and can afford the costs of
> > > > testing against the S3 service.
> > > >
> > > >
> > > > On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
> > > > 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-19 Thread Duo Zhang
What about just storing the hfile list in a file? Since now S3 has strong
consistency, we could safely overwrite a file then I think?

And since the hfile list file will be very small, renaming will not be a
big problem.

We could write the hfile list to a file called 'hfile.list.tmp', and then
rename it to 'hfile.list'.

This is safe for HDFS, and for S3, since it is not atomic, maybe we could
face that, the 'hfile.list' file is not there, but there is a
'hfile.list.tmp'.

So when opening a HStore, we first check if 'hfile.list' is there, if not,
try 'hfile.list.tmp', rename it and load it. For safety, we could write an
initial hfile list file with no hfiles. So if we can not load either
'hfile.list' or 'hfile.list.tmp', then we know something is wrong so users
should try to fix  it with HBCK.
And in HBCK, we will do a listing and generate the 'hfile.list' file.

WDYT?

Thanks.

Wellington Chevreuil  于2021年5月19日周三
下午10:43写道:

> Thank you, Andrew and Duo,
>
> Talking internally with Josh Elser, initial idea was to rebase the feature
> branch with master (in order to catch with latest commits), then focus on
> work to have a minimal functioning hbase, in other words, together with the
> already committed work from HBASE-25391, make sure flush, compactions,
> splits and merges all can take advantage of the persistent store file
> manager and complete with no need to rely on renames. These all map to the
> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test and
> validate this works well for our goals, we can then focus on snapshots,
> bulkloading and tooling.
>
> S3 now supports strong consistency, and I heard that they are also
> > implementing atomic renaming currently, so maybe that's one of the
> reasons
> > why the development is silent now..
> >
> Interesting, I had no idea this was being implemented. I know, however, a
> version of this feature is already available on latest EMR releases (at
> least from 6.2.0), and AWS team has published their own blog post with
> their results:
>
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
>
> But I do not think store hfile list in meta is the only solution. It will
> > cause cyclic dependencies for hbase:meta, and then force us a have a
> > fallback solution which makes the code a bit ugly. We should try to see
> if
> > this could be done with only the FileSystem.
> >
> This is indeed a relevant concern. One idea I had mentioned in the original
> design doc was to track committed/non-committed files through xattr (or
> tags), which may have its own performance issues as explained by Stephen
> Wu, but is something that could be attempted.
>
> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang)  >
> escreveu:
>
> > S3 now supports strong consistency, and I heard that they are also
> > implementing atomic renaming currently, so maybe that's one of the
> reasons
> > why the development is silent now...
> >
> > For me, I also think deploying hbase on cloud storage is the future, so I
> > would also like to participate here.
> >
> > But I do not think store hfile list in meta is the only solution. It will
> > cause cyclic dependencies for hbase:meta, and then force us a have a
> > fallback solution which makes the code a bit ugly. We should try to see
> if
> > this could be done with only the FileSystem.
> >
> > Thanks.
> >
> > Andrew Purtell  于2021年5月19日周三 上午8:04写道:
> >
> > > Wellington (and et. al),
> > >
> > > S3 is also an important piece of our future production plans.
> > > Unfortunately,  we were unable to assist much with last year's work, on
> > > account of being sidetracked by more immediate concerns. Fortunately,
> > this
> > > renewed interest is timely in that we have an HBase 2 project where, if
> > > this can land in a 2.5 or a 2.6, it could be an important cost to serve
> > > optimization, and one we could and would make use of. Therefore I would
> > > like to restate my employer's interest in this work too. It may just be
> > > Viraj and myself in the early days.
> > >
> > > I'm not sure how best to collaborate. We could review changes from the
> > > original authors, new changes, and/or divide up the development tasks.
> We
> > > can certainly offer our time for testing, and can afford the costs of
> > > testing against the S3 service.
> > >
> > >
> > > On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
> > > wellington.chevre...@gmail.com> wrote:
> > >
> > > > Greetings everyone,
> > > >
> > > > HBASE-24749 has been proposed almost a year ago, introducing a new
> > > > StoreFile tracker as a way to allow for any hbase hfile modifications
> > to
> > > be
> > > > safely completed without needing a file system rename. This seems
> > pretty
> > > > relevant for deployments over S3 file systems, where rename
> operations
> > > are
> > > > not atomic and can have a performance degradation when multiple
> > requests
> > > > get concurrently 

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-19 Thread Wellington Chevreuil
Thank you, Andrew and Duo,

Talking internally with Josh Elser, initial idea was to rebase the feature
branch with master (in order to catch with latest commits), then focus on
work to have a minimal functioning hbase, in other words, together with the
already committed work from HBASE-25391, make sure flush, compactions,
splits and merges all can take advantage of the persistent store file
manager and complete with no need to rely on renames. These all map to the
substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test and
validate this works well for our goals, we can then focus on snapshots,
bulkloading and tooling.

S3 now supports strong consistency, and I heard that they are also
> implementing atomic renaming currently, so maybe that's one of the reasons
> why the development is silent now..
>
Interesting, I had no idea this was being implemented. I know, however, a
version of this feature is already available on latest EMR releases (at
least from 6.2.0), and AWS team has published their own blog post with
their results:
https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/

But I do not think store hfile list in meta is the only solution. It will
> cause cyclic dependencies for hbase:meta, and then force us a have a
> fallback solution which makes the code a bit ugly. We should try to see if
> this could be done with only the FileSystem.
>
This is indeed a relevant concern. One idea I had mentioned in the original
design doc was to track committed/non-committed files through xattr (or
tags), which may have its own performance issues as explained by Stephen
Wu, but is something that could be attempted.

Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) 
escreveu:

> S3 now supports strong consistency, and I heard that they are also
> implementing atomic renaming currently, so maybe that's one of the reasons
> why the development is silent now...
>
> For me, I also think deploying hbase on cloud storage is the future, so I
> would also like to participate here.
>
> But I do not think store hfile list in meta is the only solution. It will
> cause cyclic dependencies for hbase:meta, and then force us a have a
> fallback solution which makes the code a bit ugly. We should try to see if
> this could be done with only the FileSystem.
>
> Thanks.
>
> Andrew Purtell  于2021年5月19日周三 上午8:04写道:
>
> > Wellington (and et. al),
> >
> > S3 is also an important piece of our future production plans.
> > Unfortunately,  we were unable to assist much with last year's work, on
> > account of being sidetracked by more immediate concerns. Fortunately,
> this
> > renewed interest is timely in that we have an HBase 2 project where, if
> > this can land in a 2.5 or a 2.6, it could be an important cost to serve
> > optimization, and one we could and would make use of. Therefore I would
> > like to restate my employer's interest in this work too. It may just be
> > Viraj and myself in the early days.
> >
> > I'm not sure how best to collaborate. We could review changes from the
> > original authors, new changes, and/or divide up the development tasks. We
> > can certainly offer our time for testing, and can afford the costs of
> > testing against the S3 service.
> >
> >
> > On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
> > wellington.chevre...@gmail.com> wrote:
> >
> > > Greetings everyone,
> > >
> > > HBASE-24749 has been proposed almost a year ago, introducing a new
> > > StoreFile tracker as a way to allow for any hbase hfile modifications
> to
> > be
> > > safely completed without needing a file system rename. This seems
> pretty
> > > relevant for deployments over S3 file systems, where rename operations
> > are
> > > not atomic and can have a performance degradation when multiple
> requests
> > > get concurrently submitted to the same bucket. We had done superficial
> > > tests and ycsb runs, where individual renames of files larger than 5GB
> > can
> > > take a few hundreds of seconds to complete. We also observed impacts in
> > > write loads throughput, the bottleneck potentially being the renames.
> > >
> > > With S3 being an important piece of my employer cloud solution, we
> would
> > > like to help it move forward. We plan to contribute new patches per the
> > > original design/Jira, but we’d also be happy to review changes from the
> > > original authors, too. Please let us know if anyone has any concerns,
> > > otherwise we’ll start to self-assign issues on HBASE-24749
> > >
> > > Wellington
> > >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> >- A23, Crosstalk
> >
>


Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-18 Thread Duo Zhang
S3 now supports strong consistency, and I heard that they are also
implementing atomic renaming currently, so maybe that's one of the reasons
why the development is silent now...

For me, I also think deploying hbase on cloud storage is the future, so I
would also like to participate here.

But I do not think store hfile list in meta is the only solution. It will
cause cyclic dependencies for hbase:meta, and then force us a have a
fallback solution which makes the code a bit ugly. We should try to see if
this could be done with only the FileSystem.

Thanks.

Andrew Purtell  于2021年5月19日周三 上午8:04写道:

> Wellington (and et. al),
>
> S3 is also an important piece of our future production plans.
> Unfortunately,  we were unable to assist much with last year's work, on
> account of being sidetracked by more immediate concerns. Fortunately, this
> renewed interest is timely in that we have an HBase 2 project where, if
> this can land in a 2.5 or a 2.6, it could be an important cost to serve
> optimization, and one we could and would make use of. Therefore I would
> like to restate my employer's interest in this work too. It may just be
> Viraj and myself in the early days.
>
> I'm not sure how best to collaborate. We could review changes from the
> original authors, new changes, and/or divide up the development tasks. We
> can certainly offer our time for testing, and can afford the costs of
> testing against the S3 service.
>
>
> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
> wellington.chevre...@gmail.com> wrote:
>
> > Greetings everyone,
> >
> > HBASE-24749 has been proposed almost a year ago, introducing a new
> > StoreFile tracker as a way to allow for any hbase hfile modifications to
> be
> > safely completed without needing a file system rename. This seems pretty
> > relevant for deployments over S3 file systems, where rename operations
> are
> > not atomic and can have a performance degradation when multiple requests
> > get concurrently submitted to the same bucket. We had done superficial
> > tests and ycsb runs, where individual renames of files larger than 5GB
> can
> > take a few hundreds of seconds to complete. We also observed impacts in
> > write loads throughput, the bottleneck potentially being the renames.
> >
> > With S3 being an important piece of my employer cloud solution, we would
> > like to help it move forward. We plan to contribute new patches per the
> > original design/Jira, but we’d also be happy to review changes from the
> > original authors, too. Please let us know if anyone has any concerns,
> > otherwise we’ll start to self-assign issues on HBASE-24749
> >
> > Wellington
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>- A23, Crosstalk
>


Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-18 Thread Andrew Purtell
Wellington (and et. al),

S3 is also an important piece of our future production plans.
Unfortunately,  we were unable to assist much with last year's work, on
account of being sidetracked by more immediate concerns. Fortunately, this
renewed interest is timely in that we have an HBase 2 project where, if
this can land in a 2.5 or a 2.6, it could be an important cost to serve
optimization, and one we could and would make use of. Therefore I would
like to restate my employer's interest in this work too. It may just be
Viraj and myself in the early days.

I'm not sure how best to collaborate. We could review changes from the
original authors, new changes, and/or divide up the development tasks. We
can certainly offer our time for testing, and can afford the costs of
testing against the S3 service.


On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
wellington.chevre...@gmail.com> wrote:

> Greetings everyone,
>
> HBASE-24749 has been proposed almost a year ago, introducing a new
> StoreFile tracker as a way to allow for any hbase hfile modifications to be
> safely completed without needing a file system rename. This seems pretty
> relevant for deployments over S3 file systems, where rename operations are
> not atomic and can have a performance degradation when multiple requests
> get concurrently submitted to the same bucket. We had done superficial
> tests and ycsb runs, where individual renames of files larger than 5GB can
> take a few hundreds of seconds to complete. We also observed impacts in
> write loads throughput, the bottleneck potentially being the renames.
>
> With S3 being an important piece of my employer cloud solution, we would
> like to help it move forward. We plan to contribute new patches per the
> original design/Jira, but we’d also be happy to review changes from the
> original authors, too. Please let us know if anyone has any concerns,
> otherwise we’ll start to self-assign issues on HBASE-24749
>
> Wellington
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk


[DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

2021-05-18 Thread Wellington Chevreuil
Greetings everyone,

HBASE-24749 has been proposed almost a year ago, introducing a new
StoreFile tracker as a way to allow for any hbase hfile modifications to be
safely completed without needing a file system rename. This seems pretty
relevant for deployments over S3 file systems, where rename operations are
not atomic and can have a performance degradation when multiple requests
get concurrently submitted to the same bucket. We had done superficial
tests and ycsb runs, where individual renames of files larger than 5GB can
take a few hundreds of seconds to complete. We also observed impacts in
write loads throughput, the bottleneck potentially being the renames.

With S3 being an important piece of my employer cloud solution, we would
like to help it move forward. We plan to contribute new patches per the
original design/Jira, but we’d also be happy to review changes from the
original authors, too. Please let us know if anyone has any concerns,
otherwise we’ll start to self-assign issues on HBASE-24749

Wellington