[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-09-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451437#comment-13451437
 ] 

Hudson commented on HBASE-6590:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #166 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/166/])
HBASE-6630 Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk 
loaded files (Amitanand) (Revision 1382351)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* /hbase/trunk/hbase-server/src/main/protobuf/Client.proto
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java


> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction

[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-09-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451418#comment-13451418
 ] 

Hudson commented on HBASE-6590:
---

Integrated in HBase-TRUNK #3316 (See 
[https://builds.apache.org/job/HBase-TRUNK/3316/])
HBASE-6630 Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk 
loaded files (Amitanand) (Revision 1382351)

 Result = SUCCESS
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* /hbase/trunk/hbase-server/src/main/protobuf/Client.proto
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java


> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction to a major compaction. To add

[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-08-21 Thread Amitanand Aiyer (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439095#comment-13439095
 ] 

Amitanand Aiyer commented on HBASE-6590:


@stack: yes. the regionserver gets the sequenceId from the HLog when doing the 
bulkLoad operation. 
On success, the file is renamed to a random name. If we are assigning 
sequenceIds, this random name is appended with a string of the form 
_SeqId__ that can be parsed by StoreFile to get the sequence number.

> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction to a major compaction. To address this, these files 
> are excluded from minor compaction, based on a config param. (enabled for the 
> messages use case).
>- Since, bulk loaded files that are not back-filling data do not cause 
> this issue, they will not be ignored during minor compactions based on the 
> config parameter. This is also required to ensure that there are no holes in 
> the set of files selected for compaction -- this is necessary to preserve the 
> order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-08-17 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436972#comment-13436972
 ] 

stack commented on HBASE-6590:
--

How would we get current sequence number on bulk load?  We's ask the 
regionserver we were bulk loading into?  Would there be a rename of the hfile 
on successful bulk load to add the sequenceid to the filename?

I like the notion of adding sequenceid to filename.  It'd be after current 
filename which is ts IIRC?  Or is it random number.

> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction to a major compaction. To address this, these files 
> are excluded from minor compaction, based on a config param. (enabled for the 
> messages use case).
>- Since, bulk loaded files that are not back-filling data do not cause 
> this issue, they will not be ignored during minor compactions based on the 
> config parameter. This is also required to ensure that there are no holes in 
> the set of files selected for compaction -- this is necessary to preserve the 
> order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-08-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436952#comment-13436952
 ] 

Lars Hofhansl commented on HBASE-6590:
--

@Amit: If you have time to make a trunk patch that'd be cool.

> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction to a major compaction. To address this, these files 
> are excluded from minor compaction, based on a config param. (enabled for the 
> messages use case).
>- Since, bulk loaded files that are not back-filling data do not cause 
> this issue, they will not be ignored during minor compactions based on the 
> config parameter. This is also required to ensure that there are no holes in 
> the set of files selected for compaction -- this is necessary to preserve the 
> order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-08-17 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436949#comment-13436949
 ] 

Zhihong Ted Yu commented on HBASE-6590:
---

@Amit:
Please create another JIRA and rebase your patch for HBase trunk.

Thanks a lot.

> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction to a major compaction. To address this, these files 
> are excluded from minor compaction, based on a config param. (enabled for the 
> messages use case).
>- Since, bulk loaded files that are not back-filling data do not cause 
> this issue, they will not be ignored during minor compactions based on the 
> config parameter. This is also required to ensure that there are no holes in 
> the set of files selected for compaction -- this is necessary to preserve the 
> order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-08-17 Thread Amitanand Aiyer (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436944#comment-13436944
 ] 

Amitanand Aiyer commented on HBASE-6590:


Yes, thanks. I did mean compaction. Have updated the summary/subject 
accordingly.

> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction to a major compaction. To address this, these files 
> are excluded from minor compaction, based on a config param. (enabled for the 
> messages use case).
>- Since, bulk loaded files that are not back-filling data do not cause 
> this issue, they will not be ignored during minor compactions based on the 
> config parameter. This is also required to ensure that there are no holes in 
> the set of files selected for compaction -- this is necessary to preserve the 
> order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6590) [0.89-fb] Assign sequence number to bulk loaded data

2012-08-17 Thread Amitanand Aiyer (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436942#comment-13436942
 ] 

Amitanand Aiyer commented on HBASE-6590:


Thanks Ted/Lars. 

Here is the diff that was reviewed for 0.89-fb:
https://reviews.facebook.net/D3789 

Will be happy to port it to trunk, if you think it'll be useful.

> [0.89-fb] Assign sequence number to bulk loaded data
> 
>
> Key: HBASE-6590
> URL: https://issues.apache.org/jira/browse/HBASE-6590
> Project: HBase
>  Issue Type: Bug
>Reporter: Amitanand Aiyer
>Assignee: Amitanand Aiyer
>Priority: Minor
> Fix For: 0.89-fb
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they 
> can only be used to import historical data, dating to the past. There are 
> cases where we want to bulk load "current data"; but the bulk load mechanism 
> does not support this, as the bulk loaded files are always sorted behind the 
> non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should 
> solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a 
> monotonically increasing number that accompanies every edit written to the 
> WAL. For entries that update the same cell, we would like the latter edit to 
> win. This comparision is accomplished using memstoreTS, at the KV level; and 
> sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not 
> have a sequenceId written in the file.  This causes HBase to lose track of 
> the point in time, when the BulkLoaded file was imported to HBase. Resulting 
> in a behavior, that **only** supports viewing bulkLoaded files as files 
> back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file 
> to fit in where we want. Either at the "current time" or the "begining of 
> time". The latter is the default, to maintain backward compatibility.
> Design approach:
>   Store files keep track of the sequence Id in the trailer. Since we do not 
> wish to edit/rewrite the bulk loaded file upon import, we will encode the 
> assigned sequenceId into the fileName. The filename RegEx is updated for this 
> regard. If the sequenceId is encoded in the filename, the sequenceId will be 
> used as the sequenceId for the file. If none is found, the sequenceId will be 
> considered 0 (as per the default, backward-compatible behavior).
>   To enable clients to request pre-existing behavior, the command line 
> utility allows for 2 ways to import BulkLoaded Files: to assign or not assign 
> a sequence Number. 
>- If a sequence Number is assigned, the imporeted file will be imported 
> with the "current sequence Id".
>- if the sequence Number is not assigned, it will be as if it was 
> backfilling old data, from the begining of time.
> Compaction behavior:
>   - With the current compaction algorithm, bulk loaded files -- that backfill 
> data, to the begining of time -- can cause a compaction storm, converting 
> every minor compaction to a major compaction. To address this, these files 
> are excluded from minor compaction, based on a config param. (enabled for the 
> messages use case).
>- Since, bulk loaded files that are not back-filling data do not cause 
> this issue, they will not be ignored during minor compactions based on the 
> config parameter. This is also required to ensure that there are no holes in 
> the set of files selected for compaction -- this is necessary to preserve the 
> order of KV's comparision before and after comparision.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira