[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-30 Thread Himanshu Vashishtha (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645599#comment-13645599
 ] 

Himanshu Vashishtha commented on HBASE-6774:


Hey Enis,

Thanks for asking these questions.

There is a *max_completeSequenceId* per regionserver field in the attached doc, 
which is updated after receiving the heartbeat from a regionserver. When master 
processes the server shutdown event, it will use the max_completeSequenceId for 
the regionserver in order to determine how much WAL is relevant (it has missed) 
and need to read before finalizing allWALEntriesFlushed. The goal is to process 
all WALEdits which have walEdit#key#logSequenceId  max_completeSequenceId. If 
that means reading second last WAL also, it will process that too. The 
invariant is to read latest WAL files first, until we reach the point where 
some waledits in the wal are s.t. WALedit#key#logSequenceId  
max_completeSequenceId. We no longer need to read older WALs then. 

bq. If a region has not got any update for some time, its 
latestCompleteFlushSeqId wont be updated at all, since there will be no 
flushes. To reassign this region, we have to ensure that all wals are read. 

It uses max_completeSequenceId to read the remaining WAL. Once it has read all 
the WALEdits after max_completeSequenceId, allWALEntriesFlushed will have the 
correct information, and it can be used to assign a region or not. 


bq. The only reliable way is to read up the wal backwards, 
I am not sure whether a sequenceFile can be read backwards, or how efficient it 
would be. That's why I propose to read a WAL file from its head and re-use the 
existing WALReader code.

As soon as any region is flushed, master will have the most updated information 
for all regions for that regionserver once it receives the next heartbeat.

Consider a rogue scenario: A regionserver sends a report and the 
max_completeSequenceId = 100. There is a write heavy workload and WAL is rolled 
and then server abort. And master missed all its heartbeats before the rs 
aborted. Based on max_completeSequenceId, we need to read last 2 WAL files (1 + 
1): 1 new one, and 1 at which master got the last heartbeat (it has some 
entries  100). Since we are reading most current ones first, it is easy to 
determine whether we need to older WALs or not. Let's call those files f1 and 
f2 where f1 is the latest. 
It reads f1 first and see that the first waledit#key#logSequenceId  100, so it 
en-queues f2 also as there might be some entries at f2's tail which are missed.
Once it has read f1 and f2, and updated the allWALEntriesFlushed for the 
regions, master can decide which regions can be assigned right away.

Hope this helps.

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha
 Attachments: HBase-6774-approach.pdf


 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And 

[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-30 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645829#comment-13645829
 ] 

Enis Soztutar commented on HBASE-6774:
--

Thanks for the explanation. It seems that this can work, but the relative gain 
may not be that much to justify it. Other proposal for writing the list of 
region names in wal header, and reading them to determine, which tasks should 
be complete before to make sure the assignment seems more cleaner to me. 

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha
 Attachments: HBase-6774-approach.pdf


 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-30 Thread Himanshu Vashishtha (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646005#comment-13646005
 ] 

Himanshu Vashishtha commented on HBASE-6774:


Thanks Enis.

Yes, WAL approach is also there but I think they both have their own plus and 
minus points. I proposed the ServerLoad approach because it is self contained 
and doesn't involve any changes in WAL/SequenceFile, etc, and re-uses existing 
ServerLoad object. 

In WAL meta data case, some meta data should be appended at the end of a WAL 
file. This involves adding custom key-value while closing the WAL file, and a 
check while reading every record (whether it is a meta record or not, etc).
Since it will be added at the end, master needs to open the reader and seek to 
the end of the file. This meta data should be read for all the log files, in a 
sequential manner starting from the oldest wal file in order to track a region 
timeline. This is in addition to reading the last WAL file.
An application that have high write rates, a regionserver may have larger 
number of WALs to replay.

Another point is, IMHO, this feature should be made configurable as there might 
be some workloads which may not require this (writes distributed on all 
key-space, etc). With WAL approach, it becomes little bit tricky to make this 
feature optional, as it is inserting meta data in the WAL. With some meta entry 
in a WAL file, LogReader should always be aware of such entries, be it 
ReplicationLogReaders or LogSplitter as they might be reading some old logs, 
etc.

bq. It seems that this can work, but the relative gain may not be that much to 
justify it.
This is just an alternative approach to the WAL one, and I think it is less 
intrusive. But I am open to both and would like to hear more of your opinions 
on the above points. 


 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha
 Attachments: HBase-6774-approach.pdf


 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting 

[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-29 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645213#comment-13645213
 ] 

Enis Soztutar commented on HBASE-6774:
--

If I understand this correctly, this allWALEntriesFlushed does not seem to 
contain reliable information. With this proposal, it seems that we have to read 
the last WAL files to update the allWalEntriesFlushed to make up for the fact 
that the last heartbeat might not complete etc. But hartbeats themselves are 
not reliable as well. We cannot assume that by just reading the last WAL file, 
allWalEntriesFlushed will be correct, since we might have been missing 
hearthbeats for some time. The only reliable way is to read up the wal 
backwards, until for each region we make sure that we have read up to 
latestCompleteFlushSeqId. Which makes allWALEntriesFlushed redundant. 
If a region has not got any update for some time, its latestCompleteFlushSeqId 
wont be updated at all, since there will be no flushes. To reassign this 
region, we have to ensure that all wals are read. 




 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha
 Attachments: HBase-6774-approach.pdf


 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-28 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644252#comment-13644252
 ] 

ramkrishna.s.vasudevan commented on HBASE-6774:
---

{code}
This is to cover all the regions that were updated after the last ServerLoad 
report and shutdown event. Once it has read the WAL files (usually only the 
last one) which have sequenceIds greater than max_completeSequenceId, it sends 
an open request for regions which has allWALEntriesFlushed set to true, as they 
don’t need to wait for log splitting/replaying to complete
{code}
Am not clear in this area.  I may be missing something in my understanding. Pls 
do correct me if am wrong.
I have allWALEntriesFlushed set to true, but the region has some additional wal 
entries in HLog just before the report and abrupt shutdown event happened.
When you say they don't need to wait for Log Splitting? 
Also did you see Jeffrey's latest work on Log Splitting.  His proposal also 
uses the LatestCompleteFlushSeqId.  
Thanks for the write up.


 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha
 Attachments: HBase-6774-approach.pdf


 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-28 Thread Himanshu Vashishtha (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644274#comment-13644274
 ] 

Himanshu Vashishtha commented on HBASE-6774:


Hey Ram,
 
Thanks for reading it through.

bq.  When you say they don't need to wait for Log Splitting? 
So in case when there are some mutations after the last Serverreport and 
shutdown event, we need to look at the last WAL. Once we have read it and 
updated the  region:allWalEntriesFlushed mapping for WALEdits which has 
logSeqNum  max_completeSequenceId, we can open those regions which has 
allWalEntriesFlushed still set to true.

Yes, I have looked at Jeffrey's work and will review it more this week. I 
didn't think he proposed any change in the usage of latestCompleteFlushSeqId, 
though. Also, that jira will make regions available for writes, and this is 
about making pristine regions available for reads.

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha
 Attachments: HBase-6774-approach.pdf


 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-18 Thread Himanshu Vashishtha (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635351#comment-13635351
 ] 

Himanshu Vashishtha commented on HBASE-6774:


[~nkeywal] [~devaraj]: I am interested to know whether there is any progress on 
this issue (making regions available which do not have a WAL entry, i.e., not 
waiting for log splitting to finish). Faced this when working on a read 
intensive workload. As Nkeywal commented earlier, it is quite useful for some 
use-cases. There is already a separate WAL for .META., thanks to Devaraj. If 
you guys are OK, I would like to work on this.


 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-18 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635395#comment-13635395
 ] 

Devaraj Das commented on HBASE-6774:


I am fine with that, [~v.himanshu].. I guess we should start with a proposal 
and agree on (this jira had multiple proposals).

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-18 Thread Nicolas Liochon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635384#comment-13635384
 ] 

Nicolas Liochon commented on HBASE-6774:


Ok for me of course :-). Thanks for this. I don't have an ideal solution in 
mind, I guess there is some design work to do here, but may be Devaraj is more 
advanced than me. I assign the jira to you in case you don't have the ar for 
this.

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2013-04-18 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635427#comment-13635427
 ] 

Lars Hofhansl commented on HBASE-6774:
--

This mingles (somewhat at least) with HBASE-8375 that I just opened. One of the 
options proposed there are unlogged tables (tables that never write WAL 
entries). All regions of those tables could be assigned immediately.


 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Himanshu Vashishtha

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-22 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502845#comment-13502845
 ] 

nkeywal commented on HBASE-6774:


For the master based solution
If we go for the regionserver - master  - zookeeper solution, it's not 
perfect imho, because we just add an agent in the middle.

The master could store the region information, without going to ZK
- Faster than the solution with ZK, because we would not write to the disk
- If we lose the master, we lose the date, but it's not an issue (just that 
the recovery will be slower: we will have to read all the logs)
- The master becomes an element of the write path (for the first write in a 
memstore). I'm not at ease with that.

At the end of the day, I agree with what Stack said previously: let's not add a 
new component in the write path. This is valid for both the master  ZK.

So we're left with the other options:
- specific WAL for .meta.
- adding meta data at the end of the WAL.

I'm currently looking at them.

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-22 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502985#comment-13502985
 ] 

Devaraj Das commented on HBASE-6774:


I am starting to prototype the specific wal for .meta. approach (leveraging the 
implementation of FSHlog) to get a feel for the complexity, etc. Will keep 
folks posted (and probably raise a separate jira as well).

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502132#comment-13502132
 ] 

nkeywal commented on HBASE-6774:


After thinking again about this one, here is another possible solution:
- put the memstore state in ZooKeeper
- when we create a new memstore, we asynchronously write the state in ZK 
(region with empty memstore  region server name)
- When the first put is written in the WAL, we synchronously write to ZK that 
this region has now an non empty memstore.
- then the other puts don't need any ZK writes or synchronisation
- on memstore flush, we asynchronously update the state in ZK to empty memstore 
region.
- on crash, the master checks the region memstore states. If region is assigned 
but its memstore is empty, we can reassign the region immediately. If there is 
no data in ZK, or this data says the memstore is not empty, the master does 
nothing.

This is high level, I obviously need to tune it for multiple memstore case and 
study all error cases. But it seems doable.

So we would have a maximum of 100K znodes (1 per region) in ZK, with one viewer 
(the master), and one writer (the region server).
These objects would be written on memstore creation  flush, so not very often.
If we don't have the znode in ZK, we split as today. We could loose the whole 
ZK data without any impact.
This can be made optional (and may be even activated per table: it could be 
activated only for reference tables and meta. Tables heavily written would not 
do that. This lowers the number of znode to write into ZK)
Region servers are already connected to zookeeper, we don't add any ZK 
connection.

Pros:
- do the job: the region non written will be reassigned immediately
- add a security if we can't split the logs: the table that were not written 
can be made available immediately
- optional, and configurable per table
- should not decrease write performances; only the first put is impacted (by 
about 10-15ms). With a block size of 128Mb or more, it's acceptable imho.
- don't add workload (read nor write) on HDFS
- no dependency on ZK content: we continue to work if the ZK content 
'disappears'.

Cons:
- add workload on ZooKeeper: but it's configurable per table, so we can limit 
to whatever we want. We can even imagine heuristic (wait before creating the 
znode, don't create it if a put occurs before 10 seconds for example)
- as always, any new feature adds complexity to the whole thing... Could nearly 
be done with coprocessors (likely not the master part however).


 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica 

[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-21 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502181#comment-13502181
 ] 

Jimmy Xiang commented on HBASE-6774:


Can we just split log and assign region in parallel?  In opening a region, we 
check if the region is involved in log splitting somehow.  If not, open it.  
Otherwise, hold there till the log splitting is done for that region.

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502309#comment-13502309
 ] 

stack commented on HBASE-6774:
--

I like Jimmy's suggestion for first step at lowering MTTR hereabouts.

What if we wrote on the end of a WAL a list of all regions mentioned?

On crash, we'd look at the tail of all WALs and scan fully all WALs that were 
not properly closed to get the list of regions with edits.  Could be done in 
master.  Before a region opens, could query master if it needs to pick up edits 
from a split?  (Maybe only do this is the assign is because of regionserver 
crash -- add a marker to the assign message).

Writing stuff to zk could work but would be better if we could avoid having to 
do this?

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502375#comment-13502375
 ] 

nkeywal commented on HBASE-6774:


Yes, everything is in the somehow of ??we check if the region is involved in 
log splitting somehow?? :-)

The advantage of doing that in ZK is that we dont have to open all the WALs, 
with the risk of going to a dead datanode (bad datanode often means 60 seconds 
delay). And we don't have to read fully the last one (as well, if we finally 
implement the multi WALs, we will have all these WALs to fully read). For the 
others, technically, reading backward may be difficult to optimize. As well, if 
the WAL is corrupted, we save the regions that were not written to.


This said, I agree that writing to ZK is not an easy decision. I think on the 
long term, having a widely shared real time status on the region is 
interesting, but we need the middleware (ZK here) to support this (lots of 
znodes with lots of readers). It's my famous  ZOOKEEPER-1147.

Devaraj told me an idea from Enis for the .meta. case: just do a specific WAL 
for it. It can be generalized with the multiwals as well.

All these solutions are not incompatible between themselves anyway...


 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502413#comment-13502413
 ] 

stack commented on HBASE-6774:
--

If multiwals, yeah, should dedicate one for .META.

Agree all suggestions are not incompatible: i.e. we should do Jimmy's 
suggestion (You may have suggested similar a while back IIRC).

I like the issues you raise w/ the soln. I suggest.  While we could read the 
last WAL while splitting, just reading metadata off the end of all WALs 
concerned would take say ... about a second for each unless we did it in //... 
and if tens of WALs, thats tens of seconds before we could open a region even 
when all is functioning without hiccups (add hiccups, bad DN and it goes up 
significantly).

I do like not having to have another subsystem in the mix doing log splitting. 
I suppose we already have an optional dependency on zk farming out the work.

Could the regionserver send the master the regions mentioned in a WAL and let 
it do accounting or it could send sequenceids by flush to the master and let it 
figure out up to what entry it can skip edits?  It could do the writing to zk 
instead of every regionserver doing it for every WAL roll.  We are already 
sending over seqids on heartbeat so we can skip stale edits on crash?  Could 
expand this functionality so a by region dimension?  (Haven't thought it 
through... just making suggestion)



 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog

2012-11-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502451#comment-13502451
 ] 

nkeywal commented on HBASE-6774:


bq. If multiwals, yeah, should dedicate one for .META.
Is someone working on multiwals implementation, or is it still in the 
currently studied state?

bq. we should do Jimmy's suggestion. (You may have suggested similar a while 
back IIRC)
Yes, it's option 3) in this jira description :-). I was not totally satisfied, 
that why I tried to find something different. I agree it's more a different 
balance than a better solution.

bq. Could the regionserver send the master the regions mentioned in a WAL and 
let it do accounting or it could send sequenceids by flush to haven't though 
about the master option, it could be a solution. I need to think about it.

 Immediate assignment of regions that don't have entries in HLog
 ---

 Key: HBASE-6774
 URL: https://issues.apache.org/jira/browse/HBASE-6774
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 The algo is today, after a failure detection:
 - split the logs
 - when all the logs are split, assign the regions
 But some regions can have no entries at all in the HLog. There are many 
 reasons for this:
 - kind of reference or historical tables. Bulk written sometimes then read 
 only.
 - sequential rowkeys. In this case, most of the regions will be read only. 
 But they can be in a regionserver with a lot of writes.
 - tables flushed often for safety reasons. I'm thinking about meta here.
 For meta; we can imagine flushing very often. Hence, the recovery for meta, 
 in many cases, will be the failure detection time.
 There are different possible algos:
 Option 1)
  A new task is added, in parallel of the split. This task reads all the HLog. 
 If there is no entry for a region, this region is assigned.
  Pro: simple
  Cons: We will need to read all the files. Add a read.
 Option 2)
  The master writes in ZK the number of log files, per region.
  When the regionserver starts the split, it reads the full block (64M) and 
 decrease the log file counter of the region. If it reaches 0, the assign 
 start. At the end of its split, the region server decreases the counter as 
 well. This allow to start the assign even if not all the HLog are finished. 
 It would allow to make some regions available even if we have an issue in one 
 of the log file.
  Pro: parallel
  Cons: add something to do for the region server. Requites to read the whole 
 file before starting to write. 
 Option 3)
  Add some metadata at the end of the log file. The last log file won't have 
 meta data, as if we are recovering, it's because the server crashed. But the 
 others will. And last log file should be smaller (half a block on average).  
 Option 4) Still some metadata, but in a different file. Cons: write are 
 increased (but not that much, we just need to write the region once). Pros: 
 if we lose the HLog files (major failure, no replica available) we can still 
 continue with the regions that were not written at this stage.
 I think it should be done, even if none of the algorithm above is totally 
 convincing yet. It's linked as well to locality and short circuit reads: with 
 these two points reading the file twice become much less of an issue for 
 example. My current preference would be to open the file twice in the region 
 server, once for splitting as of today, once for a quick read looking for 
 unused regions. Who knows, may be it would even be faster this way, the quick 
 read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira