[jira] [Commented] (HBASE-15469) Take snapshot by family

2016-03-21 Thread Jianwei Cui (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203878#comment-15203878
 ] 

Jianwei Cui commented on HBASE-15469:
-

For our case, the goal is to copy existed data for given families and clone the 
snapshot, so that creating a new table with only the subset families is a 
better choice. For the restore case, the goal is to rollback the table to some 
history state? the snapshot with only a subset of families may not represent 
any history state of the table, so that should not be used for the restore 
purpose.
{quote}
we may block the restore of snapshots with only a subset of families. and that 
will solve the strange situation of restore. 
and when we clone we just create a new table with only the subset. In theory 
this is more clear for the end user. 
{quote}
Agreed with your analysis [~mbertozzi], and also expect other opinions and 
cases. Thanks!

> Take snapshot by family
> ---
>
> Key: HBASE-15469
> URL: https://issues.apache.org/jira/browse/HBASE-15469
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 2.0.0
>Reporter: Jianwei Cui
> Attachments: HBASE-15469-v1.patch, HBASE-15469-v2.patch
>
>
> In our production environment, there are some 'wide' tables in offline 
> cluster. The 'wide' table has a number of families, different applications 
> will access different families of the table through MapReduce. When some 
> application starting to provide online service, we need to copy needed 
> families from offline cluster to online cluster. For future write, the 
> inter-cluster replication supports setting families for table, we can use it 
> to copy future edits for needed families. For existed data, we can take 
> snapshot of the table on offline cluster, then exploit {{ExportSnapshot}} to 
> copy snapshot to online cluster and clone the snapshot. However, we can only 
> take snapshot for the whole table in which many families are not needed for 
> the application, this will lead unnecessary data copy. I think it is useful 
> to support taking snapshot by family, so that we can only copy needed data.
> Possible solution to support such function:
> 1. Add family names field to the protobuf definition of 
> {{SnapshotDescription}}
> 2. Allow to set families when taking snapshot in hbase shell, such as:
> {code}
>snapshot 'tableName', 'snapshotName', 'FamilyA', 'FamilyB', {SKIP_FLUSH => 
> true}
> {code}
> 3. Add family names to {{SnapshotDescription}} in client side
> 4. Read family names from {{SnapshotDescription}} in Master/Regionserver, 
> keep only requested families when taking snapshot for region.
> Discussions and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15469) Take snapshot by family

2016-03-19 Thread Jianwei Cui (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200847#comment-15200847
 ] 

Jianwei Cui commented on HBASE-15469:
-

Good question! Yes, the current path will create all families when cloning or 
restoring. This could be optional for user. For most cases, it is more 
reasonable to only retain the requested families when taking snapshot? Users 
can add other needed families after cloning or restoring. What do you think? 
[~mbertozzi]. Thanks.

> Take snapshot by family
> ---
>
> Key: HBASE-15469
> URL: https://issues.apache.org/jira/browse/HBASE-15469
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 2.0.0
>Reporter: Jianwei Cui
> Attachments: HBASE-15469-v1.patch
>
>
> In our production environment, there are some 'wide' tables in offline 
> cluster. The 'wide' table has a number of families, different applications 
> will access different families of the table through MapReduce. When some 
> application starting to provide online service, we need to copy needed 
> families from offline cluster to online cluster. For future write, the 
> inter-cluster replication supports setting families for table, we can use it 
> to copy future edits for needed families. For existed data, we can take 
> snapshot of the table on offline cluster, then exploit {{ExportSnapshot}} to 
> copy snapshot to online cluster and clone the snapshot. However, we can only 
> take snapshot for the whole table in which many families are not needed for 
> the application, this will lead unnecessary data copy. I think it is useful 
> to support taking snapshot by family, so that we can only copy needed data.
> Possible solution to support such function:
> 1. Add family names field to the protobuf definition of 
> {{SnapshotDescription}}
> 2. Allow to set families when taking snapshot in hbase shell, such as:
> {code}
>snapshot 'tableName', 'snapshotName', 'FamilyA', 'FamilyB', {SKIP_FLUSH => 
> true}
> {code}
> 3. Add family names to {{SnapshotDescription}} in client side
> 4. Read family names from {{SnapshotDescription}} in Master/Regionserver, 
> keep only requested families when taking snapshot for region.
> Discussions and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15469) Take snapshot by family

2016-03-19 Thread Jianwei Cui (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199512#comment-15199512
 ] 

Jianwei Cui commented on HBASE-15469:
-

Upload the patch. In hbase shell, we scan specify families when taking snapshot 
as:
{code}
hbase(main):004:0> snapshot 'test_table', 'test-snapshot', 'f1'
0 row(s) in 0.3830 seconds
{code}
And {{list_snapshots}} will show the table and families of the snapshot:
{code}
hbase(main):001:0> list_snapshots
SNAPSHOT  TABLE/CFs + CREATION TIME 

  
 test-snapshottest_table/f1 (Thu Mar 17 
20:54:22 +0800 2016)
  
1 row(s) in 0.2890 seconds
{code}
This snapshot could be operated by other operations, such as 
{{clone_snapshot}}, {{restore_snapshot}}, etc.

> Take snapshot by family
> ---
>
> Key: HBASE-15469
> URL: https://issues.apache.org/jira/browse/HBASE-15469
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 2.0.0
>Reporter: Jianwei Cui
> Attachments: HBASE-15469-v1.patch
>
>
> In our production environment, there are some 'wide' tables in offline 
> cluster. The 'wide' table has a number of families, different applications 
> will access different families of the table through MapReduce. When some 
> application starting to provide online service, we need to copy needed 
> families from offline cluster to online cluster. For future write, the 
> inter-cluster replication supports setting families for table, we can use it 
> to copy future edits for needed families. For existed data, we can take 
> snapshot of the table on offline cluster, then exploit {{ExportSnapshot}} to 
> copy snapshot to online cluster and clone the snapshot. However, we can only 
> take snapshot for the whole table in which many families are not needed for 
> the application, this will lead unnecessary data copy. I think it is useful 
> to support taking snapshot by family, so that we can only copy needed data.
> Possible solution to support such function:
> 1. Add family names field to the protobuf definition of 
> {{SnapshotDescription}}
> 2. Allow to set families when taking snapshot in hbase shell, such as:
> {code}
>snapshot 'tableName', 'snapshotName', 'FamilyA', 'FamilyB', {SKIP_FLUSH => 
> true}
> {code}
> 3. Add family names to {{SnapshotDescription}} in client side
> 4. Read family names from {{SnapshotDescription}} in Master/Regionserver, 
> keep only requested families when taking snapshot for region.
> Discussions and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15469) Take snapshot by family

2016-03-19 Thread Jianwei Cui (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201278#comment-15201278
 ] 

Jianwei Cui commented on HBASE-15469:
-

Upload v2 to remove unrelated changes in hbase-site.xml and create RB.

> Take snapshot by family
> ---
>
> Key: HBASE-15469
> URL: https://issues.apache.org/jira/browse/HBASE-15469
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 2.0.0
>Reporter: Jianwei Cui
> Attachments: HBASE-15469-v1.patch, HBASE-15469-v2.patch
>
>
> In our production environment, there are some 'wide' tables in offline 
> cluster. The 'wide' table has a number of families, different applications 
> will access different families of the table through MapReduce. When some 
> application starting to provide online service, we need to copy needed 
> families from offline cluster to online cluster. For future write, the 
> inter-cluster replication supports setting families for table, we can use it 
> to copy future edits for needed families. For existed data, we can take 
> snapshot of the table on offline cluster, then exploit {{ExportSnapshot}} to 
> copy snapshot to online cluster and clone the snapshot. However, we can only 
> take snapshot for the whole table in which many families are not needed for 
> the application, this will lead unnecessary data copy. I think it is useful 
> to support taking snapshot by family, so that we can only copy needed data.
> Possible solution to support such function:
> 1. Add family names field to the protobuf definition of 
> {{SnapshotDescription}}
> 2. Allow to set families when taking snapshot in hbase shell, such as:
> {code}
>snapshot 'tableName', 'snapshotName', 'FamilyA', 'FamilyB', {SKIP_FLUSH => 
> true}
> {code}
> 3. Add family names to {{SnapshotDescription}} in client side
> 4. Read family names from {{SnapshotDescription}} in Master/Regionserver, 
> keep only requested families when taking snapshot for region.
> Discussions and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15469) Take snapshot by family

2016-03-19 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202069#comment-15202069
 ] 

Matteo Bertozzi commented on HBASE-15469:
-

I have no strong opinion. I'd like to hear someone else opinion talking from a 
user point of view. 

cloning a snapshot and having just a subset of the columns will probably be 
weird.  maybe is better to just create a table with  only the subset of columns 
that are going to be populated. 

restore snapshot is a bit more complex. because in theory you expect only the 
families that you snapshotted are restored and the other families are kept. but 
that may result in data missing if new rows are added to the table you 
snapshotted. 

so, I have no idea. what is the best option. I guess it depends on the use 
case. 
If you don't use restore/clone of a snapshot with a subset of families as a 
backup, but just use it for MR purpose then you don't have these problems. 
we may block the restore of snapshots with only a subset of families. and that 
will solve the strange situation of restore. 
and when we clone we just create a new table with only the subset. In theory 
this is more clear for the end user. 
but again, I'd like to hear other opinions and use cases

> Take snapshot by family
> ---
>
> Key: HBASE-15469
> URL: https://issues.apache.org/jira/browse/HBASE-15469
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 2.0.0
>Reporter: Jianwei Cui
> Attachments: HBASE-15469-v1.patch, HBASE-15469-v2.patch
>
>
> In our production environment, there are some 'wide' tables in offline 
> cluster. The 'wide' table has a number of families, different applications 
> will access different families of the table through MapReduce. When some 
> application starting to provide online service, we need to copy needed 
> families from offline cluster to online cluster. For future write, the 
> inter-cluster replication supports setting families for table, we can use it 
> to copy future edits for needed families. For existed data, we can take 
> snapshot of the table on offline cluster, then exploit {{ExportSnapshot}} to 
> copy snapshot to online cluster and clone the snapshot. However, we can only 
> take snapshot for the whole table in which many families are not needed for 
> the application, this will lead unnecessary data copy. I think it is useful 
> to support taking snapshot by family, so that we can only copy needed data.
> Possible solution to support such function:
> 1. Add family names field to the protobuf definition of 
> {{SnapshotDescription}}
> 2. Allow to set families when taking snapshot in hbase shell, such as:
> {code}
>snapshot 'tableName', 'snapshotName', 'FamilyA', 'FamilyB', {SKIP_FLUSH => 
> true}
> {code}
> 3. Add family names to {{SnapshotDescription}} in client side
> 4. Read family names from {{SnapshotDescription}} in Master/Regionserver, 
> keep only requested families when taking snapshot for region.
> Discussions and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15469) Take snapshot by family

2016-03-19 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199615#comment-15199615
 ] 

Matteo Bertozzi commented on HBASE-15469:
-

what happens if you try to clone or restore? is the table created with all the 
families but those not in the snapshot will be empty?

> Take snapshot by family
> ---
>
> Key: HBASE-15469
> URL: https://issues.apache.org/jira/browse/HBASE-15469
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 2.0.0
>Reporter: Jianwei Cui
> Attachments: HBASE-15469-v1.patch
>
>
> In our production environment, there are some 'wide' tables in offline 
> cluster. The 'wide' table has a number of families, different applications 
> will access different families of the table through MapReduce. When some 
> application starting to provide online service, we need to copy needed 
> families from offline cluster to online cluster. For future write, the 
> inter-cluster replication supports setting families for table, we can use it 
> to copy future edits for needed families. For existed data, we can take 
> snapshot of the table on offline cluster, then exploit {{ExportSnapshot}} to 
> copy snapshot to online cluster and clone the snapshot. However, we can only 
> take snapshot for the whole table in which many families are not needed for 
> the application, this will lead unnecessary data copy. I think it is useful 
> to support taking snapshot by family, so that we can only copy needed data.
> Possible solution to support such function:
> 1. Add family names field to the protobuf definition of 
> {{SnapshotDescription}}
> 2. Allow to set families when taking snapshot in hbase shell, such as:
> {code}
>snapshot 'tableName', 'snapshotName', 'FamilyA', 'FamilyB', {SKIP_FLUSH => 
> true}
> {code}
> 3. Add family names to {{SnapshotDescription}} in client side
> 4. Read family names from {{SnapshotDescription}} in Master/Regionserver, 
> keep only requested families when taking snapshot for region.
> Discussions and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)