[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-21 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..

fs: defer failure from metadata load to bootstrap when data dir is missing

An upcoming patch adds a CLI tool action to remove data directories. When a
data dir is removed, all tablets with data on it will fail. Today that
failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make
that a little bit more graceful.

This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a
failure when a data dir is missing. It further changes TabletMetadata to
treat such failures as non-fatal, and adds checks to TabletBootstrap so that
the failures manifest there instead.

No tests in this patch because:
1. Andrew has already merged tablet-level tests for failed disks, and
2. The CLI tool patch adds coverage at the itest-level.

Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Reviewed-on: http://gerrit.cloudera.org:8080/8376
Reviewed-by: Todd Lipcon 
Reviewed-by: Andrew Wong 
Tested-by: Kudu Jenkins
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 136 insertions(+), 69 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Andrew Wong: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 10
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-19 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 9: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 9
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 20 Jan 2018 01:14:22 +
Gerrit-HasComments: No


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-19 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 9: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc
File src/kudu/tablet/tablet_bootstrap.cc:

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@557
PS8, Line 557:   // Ensure the tablet's data dirs are present and healthy 
before it is opened.
> The old location was in TSTabletManager::OpenTablet, which is run out of th
k... not in love with it but you're right it seems potentially tricky



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 9
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 20 Jan 2018 00:51:13 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-19 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8376

to look at the new patch set (#9).

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..

fs: defer failure from metadata load to bootstrap when data dir is missing

An upcoming patch adds a CLI tool action to remove data directories. When a
data dir is removed, all tablets with data on it will fail. Today that
failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make
that a little bit more graceful.

This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a
failure when a data dir is missing. It further changes TabletMetadata to
treat such failures as non-fatal, and adds checks to TabletBootstrap so that
the failures manifest there instead.

No tests in this patch because:
1. Andrew has already merged tablet-level tests for failed disks, and
2. The CLI tool patch adds coverage at the itest-level.

Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 136 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/9
--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 9
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-18 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@327
PS8, Line 327: FindOrDie(uuid_by_uuid_idx, uuid_idx)
> is this not just 'uuid' above?
Yeah, not sure what happened here.


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@330
PS8, Line 330:   pb->Swap(&group);
> I think now that we are on newish protobuf we could do '*pb = std::move(gro
Done


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc
File src/kudu/tablet/tablet_bootstrap.cc:

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@557
PS8, Line 557:   // Ensure the tablet's data dirs are present and healthy 
before it is opened.
> not 100% sure I follow why you defer it all the way until bootstrap time vs
The old location was in TSTabletManager::OpenTablet, which is run out of the 
bootstrap threadpool too. If you're suggesting I do it  in 
TabletMetadata::LoadFromSuperBlock (called synchronously for each tablet in 
TSTabletManager::Init), it'd require more plumbing.

Why did I move it at all? My preference is for TSTabletManager to hold 
tserver-specific behavior only, and while the "disk failure handling" logic 
isn't particularly useful for the master, it could be in the future, so 
locating it in TabletBootstrap is an easy way to ensure it makes it into every 
place that uses a tablet. Tablet::Open would be equally reasonable, I think.


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@561
PS8, Line 561: error retrieving tablet data dir group
> will this be easy enough for an operator to understand? perhaps we should s
Done


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@563
PS8, Line 563: return Status::IOError("tablet data is in a failed 
directory");
> perhaps more accurate to say that _some_ tablet data is in a failed directo
I'll clarify with 'some', but if you follow the return chain up the stack 
you'll see that TSTabletManager::OpenTablet prefixes the message with the 
tablet ID before it logs it.



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 18 Jan 2018 22:57:43 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-18 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@327
PS8, Line 327: FindOrDie(uuid_by_uuid_idx, uuid_idx)
is this not just 'uuid' above?


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@330
PS8, Line 330:   pb->Swap(&group);
I think now that we are on newish protobuf we could do '*pb = 
std::move(group);' which reads a bit better


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc
File src/kudu/tablet/tablet_bootstrap.cc:

http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@557
PS8, Line 557:   // Ensure the tablet's data dirs are present and healthy 
before it is opened.
not 100% sure I follow why you defer it all the way until bootstrap time vs 
marking it FAILED in TSTabletManager::Init immediately after it opens the 
TabletMeta (the old behavior)? Seems that way we'd have the correct failed 
status for missing tablets as early as possible vs the possibility that those 
missing tablets end up waiting on the bootstrap threadpool queue for many 
minutes behind non-failed tablets doing their log replay


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@561
PS8, Line 561: error retrieving tablet data dir group
will this be easy enough for an operator to understand? perhaps we should say 
something explanatory like "(data directory likely removed)" or somesuch?


http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@563
PS8, Line 563: return Status::IOError("tablet data is in a failed 
directory");
perhaps more accurate to say that _some_ tablet data is in a failed directory?

Do you think we could include the uuid in the message for shits and giggles?



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 18 Jan 2018 18:51:48 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-17 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8376

to look at the new patch set (#8).

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..

fs: defer failure from metadata load to bootstrap when data dir is missing

An upcoming patch adds a CLI tool action to remove data directories. When a
data dir is removed, all tablets with data on it will fail. Today that
failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make
that a little bit more graceful.

This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a
failure when a data dir is missing. It further changes TabletMetadata to
treat such failures as non-fatal, and adds checks to TabletBootstrap so that
the failures manifest there instead.

No tests in this patch because:
1. Andrew has already merged tablet-level tests for failed disks, and
2. The CLI tool patch adds coverage at the itest-level.

Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 136 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/8
--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-17 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@9
PS6, Line 9: remove data di
> nit: just remove at this point
Done


http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@20
PS6, Line 20: has already merged
> nit: has already worked
Done



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 7
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 17 Jan 2018 22:32:09 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-17 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8376

to look at the new patch set (#7).

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..

fs: defer failure from metadata load to bootstrap when data dir is missing

An upcoming patch adds CLI tool actions to remove data directories. When a
data dir is removed, all tablets with data on it will fail. Today that
failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make
that a little bit more graceful.

This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a
failure when a data dir is missing. It further changes TabletMetadata to
treat such failures as non-fatal, and adds checks to TabletBootstrap so that
the failures manifest there instead.

No tests in this patch because:
1. Andrew has already merged tablet-level tests for failed disks, and
2. The CLI tool patch adds coverage at the itest-level.

Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 136 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/7
--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 7
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-17 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 6: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@9
PS6, Line 9: add and remove
nit: just remove at this point


http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@20
PS6, Line 20: is already working
nit: has already worked



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 17 Jan 2018 19:38:31 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-12 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 5: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 5
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 13 Jan 2018 03:28:38 +
Gerrit-HasComments: No


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-09 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8376

to look at the new patch set (#3).

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..

fs: defer failure from metadata load to bootstrap when data dir is missing

An upcoming patch adds CLI tool actions to add and remove data directories.
When a data dir is removed, all tablets with data on it will fail. Today
that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to
make that a little bit more graceful.

This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a
failure when a data dir is missing. It further changes TabletMetadata to
treat such failures as non-fatal, and adds checks to TabletBootstrap so that
the failures manifest there instead.

No tests in this patch because:
1. Andrew is already working on tablet-level tests for failed disks, and
2. The CLI tool patch adds coverage at the itest-level.

Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/ts_tablet_manager.cc
10 files changed, 136 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/3
--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 3
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-09 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@81
PS2, Line 81:   Status LoadFromPB(const UuidIndexByUuidMap& uuid_idx_by_uuid,
> worth short docs explaining the cases of bad-status here.
Done


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@84
PS2, Line 84:   Status CopyToPB(const UuidByUuidIndexMap& uuid_by_uuid_idx,
> and here
Done


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@276
PS2, Line 276: data dir
 :   // is missing.
> only looked at the headers so far (not the impl) but this left me a little
Given my response to your other comment (in the impl), is this still relevant?


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@318
PS2, Line 318: if (!FindCopy(uuid_by_uuid_idx, uuid_idx, &uuid)) {
 :   return Status::NotFound(Substitute(
 :   "could not find data dir with uuid index $0", 
uuid_idx));
> would this not be a programmer error? how would you end up with a uuid_idx
It is, but I thought the symmetry with LoadFromPB was worthwhile because it 
makes the class' behavior more predictable.


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@831
PS2, Line 831:   RETURN_NOT_OK(group->CopyToPB(uuid_by_idx_, pb));
> I guess my confusion in the header was wrong, but per my comments above, I
Indeed, but since I want CopyToPB to be symmetric with LoadFromPB, a 
RETURN_NOT_OK here seemed more appropriate.


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc
File src/kudu/tablet/tablet_metadata.cc:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440
PS2, Line 440: DeleteOrphanedBlocks(orphaned_blocks);
> pondering this a bit more, I wonder whether we need to actually start stori
This was a legitimate issue that Andrew has since fixed in commit 
47b81c452194e75da7fd966f07766de4bdcdeab0.



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 09 Jan 2018 20:02:40 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2018-01-09 Thread Adar Dembo (Code Review)
Adar Dembo has restored this change. ( http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Restored
--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: restore
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2017-10-26 Thread Adar Dembo (Code Review)
Adar Dembo has abandoned this change. ( http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Abandoned

It's too difficult to do this safely, due to the various ways in which block 
IDs can be reused. KUDU-2202 links to this patch and discusses the problem 
further.
--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2017-10-26 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc
File src/kudu/tablet/tablet_metadata.cc:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440
PS2, Line 440: DeleteOrphanedBlocks(orphaned_blocks);
> I think we discussed this elsewhere, but I"m concerned about this interacti
pondering this a bit more, I wonder whether we need to actually start storing a 
separate "orphaned blocks" list for each data dir -- that way when we delete 
the failed tablet we can safely just delete the blocks that are known to be on 
the remaining disks?

Alternatively we could feed the orphaned-blocks list from each tablet into the 
LBM at startup time to ensure that we don't reuse block IDs, but that's getting 
into relatively complex layer-crossing territory



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 27 Oct 2017 00:08:46 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2017-10-26 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@81
PS2, Line 81:   Status LoadFromPB(const UuidIndexByUuidMap& uuid_idx_by_uuid,
worth short docs explaining the cases of bad-status here.


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@84
PS2, Line 84:   Status CopyToPB(const UuidByUuidIndexMap& uuid_by_uuid_idx,
and here


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@276
PS2, Line 276: data dir
 :   // is missing.
only looked at the headers so far (not the impl) but this left me a little 
confused.

This seems to indicate that the above LoadDataDirGroupFromPB actually has a 
side effect even if it returns a bad status that indicates a missing dir? Does 
this return a 'pb' even with a bad status?


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@318
PS2, Line 318: if (!FindCopy(uuid_by_uuid_idx, uuid_idx, &uuid)) {
 :   return Status::NotFound(Substitute(
 :   "could not find data dir with uuid index $0", 
uuid_idx));
would this not be a programmer error? how would you end up with a uuid_idx in a 
DataDirGroup which doesn't point to a valid entry in the UUIDs array? I would 
have expected a FindOrDie or LOG(FATAL) here


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@831
PS2, Line 831:   RETURN_NOT_OK(group->CopyToPB(uuid_by_idx_, pb));
I guess my confusion in the header was wrong, but per my comments above, I 
think any failure of this code would be indicative of a bug, no?


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc
File src/kudu/tablet/tablet_metadata.cc:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440
PS2, Line 440: DeleteOrphanedBlocks(orphaned_blocks);
I think we discussed this elsewhere, but I"m concerned about this interacting 
with blockid re-use. If we have the following scenario:

- block 1-5 on disk A
- block 6-10 on disk B

tablet has orphaned block #6 in its metadata

disk B is failed or removed, and we restart the tserver

- LBM sees '5' as max_block_id
- allocates a new block 6 on disk A
- we open the tablet, and it decides to delete block 6
-- incorrectly deletes the new block 6 instead of the old block 6

ie I think it's important not to go to proceed on a tablet when we have it 
pointing to a drive that has been removed.



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 26 Oct 2017 20:58:36 +
Gerrit-HasComments: Yes


[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing

2017-10-24 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8376


Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
..

fs: defer failure from metadata load to bootstrap when data dir is missing

An upcoming patch adds CLI tool actions to add and remove data directories.
When a data dir is removed, all tablets with data on it will fail. Today
that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to
make that a little bit more graceful.

This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a
failure when a data dir is missing. It further changes TabletMetadata to
treat such failures as non-fatal, and adds checks to TabletBootstrap so that
the failures manifest there instead.

No tests in this patch because:
1. Andrew is already working on tablet-level tests for failed disks, and
2. The CLI tool patch adds coverage at the itest-level.

Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/error_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/ts_tablet_manager.cc
11 files changed, 128 insertions(+), 70 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/1
--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo