[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451437#comment-13451437 ] Hudson commented on HBASE-6590: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #166 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/166/]) HBASE-6630 Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files (Amitanand) (Revision 1382351) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java * /hbase/trunk/hbase-server/src/main/protobuf/Client.proto * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction
[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451418#comment-13451418 ] Hudson commented on HBASE-6590: --- Integrated in HBase-TRUNK #3316 (See [https://builds.apache.org/job/HBase-TRUNK/3316/]) HBASE-6630 Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files (Amitanand) (Revision 1382351) Result = SUCCESS tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java * /hbase/trunk/hbase-server/src/main/protobuf/Client.proto * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction to a major compaction. To add
[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439095#comment-13439095 ] Amitanand Aiyer commented on HBASE-6590: @stack: yes. the regionserver gets the sequenceId from the HLog when doing the bulkLoad operation. On success, the file is renamed to a random name. If we are assigning sequenceIds, this random name is appended with a string of the form _SeqId__ that can be parsed by StoreFile to get the sequence number. > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction to a major compaction. To address this, these files > are excluded from minor compaction, based on a config param. (enabled for the > messages use case). >- Since, bulk loaded files that are not back-filling data do not cause > this issue, they will not be ignored during minor compactions based on the > config parameter. This is also required to ensure that there are no holes in > the set of files selected for compaction -- this is necessary to preserve the > order of KV's comparision before and after compaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436972#comment-13436972 ] stack commented on HBASE-6590: -- How would we get current sequence number on bulk load? We's ask the regionserver we were bulk loading into? Would there be a rename of the hfile on successful bulk load to add the sequenceid to the filename? I like the notion of adding sequenceid to filename. It'd be after current filename which is ts IIRC? Or is it random number. > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction to a major compaction. To address this, these files > are excluded from minor compaction, based on a config param. (enabled for the > messages use case). >- Since, bulk loaded files that are not back-filling data do not cause > this issue, they will not be ignored during minor compactions based on the > config parameter. This is also required to ensure that there are no holes in > the set of files selected for compaction -- this is necessary to preserve the > order of KV's comparision before and after compaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436952#comment-13436952 ] Lars Hofhansl commented on HBASE-6590: -- @Amit: If you have time to make a trunk patch that'd be cool. > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction to a major compaction. To address this, these files > are excluded from minor compaction, based on a config param. (enabled for the > messages use case). >- Since, bulk loaded files that are not back-filling data do not cause > this issue, they will not be ignored during minor compactions based on the > config parameter. This is also required to ensure that there are no holes in > the set of files selected for compaction -- this is necessary to preserve the > order of KV's comparision before and after compaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436949#comment-13436949 ] Zhihong Ted Yu commented on HBASE-6590: --- @Amit: Please create another JIRA and rebase your patch for HBase trunk. Thanks a lot. > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction to a major compaction. To address this, these files > are excluded from minor compaction, based on a config param. (enabled for the > messages use case). >- Since, bulk loaded files that are not back-filling data do not cause > this issue, they will not be ignored during minor compactions based on the > config parameter. This is also required to ensure that there are no holes in > the set of files selected for compaction -- this is necessary to preserve the > order of KV's comparision before and after compaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436944#comment-13436944 ] Amitanand Aiyer commented on HBASE-6590: Yes, thanks. I did mean compaction. Have updated the summary/subject accordingly. > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction to a major compaction. To address this, these files > are excluded from minor compaction, based on a config param. (enabled for the > messages use case). >- Since, bulk loaded files that are not back-filling data do not cause > this issue, they will not be ignored during minor compactions based on the > config parameter. This is also required to ensure that there are no holes in > the set of files selected for compaction -- this is necessary to preserve the > order of KV's comparision before and after compaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data
[ https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436942#comment-13436942 ] Amitanand Aiyer commented on HBASE-6590: Thanks Ted/Lars. Here is the diff that was reviewed for 0.89-fb: https://reviews.facebook.net/D3789 Will be happy to port it to trunk, if you think it'll be useful. > [0.89-fb] Assign sequence number to bulk loaded data > > > Key: HBASE-6590 > URL: https://issues.apache.org/jira/browse/HBASE-6590 > Project: HBase > Issue Type: Bug >Reporter: Amitanand Aiyer >Assignee: Amitanand Aiyer >Priority: Minor > Fix For: 0.89-fb > > > Currently bulk loaded files are not assigned a sequence number. Thus, they > can only be used to import historical data, dating to the past. There are > cases where we want to bulk load "current data"; but the bulk load mechanism > does not support this, as the bulk loaded files are always sorted behind the > non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should > solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId is a > monotonically increasing number that accompanies every edit written to the > WAL. For entries that update the same cell, we would like the latter edit to > win. This comparision is accomplished using memstoreTS, at the KV level; and > sequenceId at the StoreFile level (to order scanners in the KeyValueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do not > have a sequenceId written in the file. This causes HBase to lose track of > the point in time, when the BulkLoaded file was imported to HBase. Resulting > in a behavior, that **only** supports viewing bulkLoaded files as files > back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded file > to fit in where we want. Either at the "current time" or the "begining of > time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not > wish to edit/rewrite the bulk loaded file upon import, we will encode the > assigned sequenceId into the fileName. The filename RegEx is updated for this > regard. If the sequenceId is encoded in the filename, the sequenceId will be > used as the sequenceId for the file. If none is found, the sequenceId will be > considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line > utility allows for 2 ways to import BulkLoaded Files: to assign or not assign > a sequence Number. >- If a sequence Number is assigned, the imporeted file will be imported > with the "current sequence Id". >- if the sequence Number is not assigned, it will be as if it was > backfilling old data, from the begining of time. > Compaction behavior: > - With the current compaction algorithm, bulk loaded files -- that backfill > data, to the begining of time -- can cause a compaction storm, converting > every minor compaction to a major compaction. To address this, these files > are excluded from minor compaction, based on a config param. (enabled for the > messages use case). >- Since, bulk loaded files that are not back-filling data do not cause > this issue, they will not be ignored during minor compactions based on the > config parameter. This is also required to ensure that there are no holes in > the set of files selected for compaction -- this is necessary to preserve the > order of KV's comparision before and after comparision. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira