[jira] [Resolved] (HBASE-19754) Backport HBASE-11409 to branch-1 and branch-1.4

2018-01-22 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-19754.

Resolution: Duplicate

> Backport HBASE-11409 to branch-1 and branch-1.4
> ---
>
> Key: HBASE-19754
> URL: https://issues.apache.org/jira/browse/HBASE-19754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0
>Reporter: churro morales
>Assignee: churro morales
>Priority: Minor
> Attachments: HBASE-19754.branch-1.patch
>
>
> backport HBASE-11409 to branch-1, branch-1.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-19754) Backport HBASE-11409 to branch-1 and branch-1.4

2018-01-22 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales reopened HBASE-19754:


> Backport HBASE-11409 to branch-1 and branch-1.4
> ---
>
> Key: HBASE-19754
> URL: https://issues.apache.org/jira/browse/HBASE-19754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0
>Reporter: churro morales
>Assignee: churro morales
>Priority: Minor
> Attachments: HBASE-19754.branch-1.patch
>
>
> backport HBASE-11409 to branch-1, branch-1.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-13459) A more robust Verify Replication

2018-01-22 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-13459.

Resolution: Won't Fix

we have SyncTable which is much better.

> A more robust Verify Replication 
> -
>
> Key: HBASE-13459
> URL: https://issues.apache.org/jira/browse/HBASE-13459
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 1.0.1, 0.98.12
>Reporter: churro morales
>Assignee: churro morales
>Priority: Minor
> Attachments: HBASE-13459-0.98.patch
>
>
> We have done quite a bit of data center migration work in the past year.  We 
> modified verify replication a bit to help us out.
> Things like:
> Ignoring timestamps when comparing Cells
> More detailed counters when discrepancies are reported between rows added the 
> following counters: 
> SOURCEMISSINGROWS,TARGETMISSINGROWS,SOURCEMISSINGKEYS, TARGETMISSINGKEYS
> Also added the ability to run this job on any pair of tables and clusters.
> If folks are interested I can put up the patch and backport.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-13043) Backport HBASE-11436 to 94 branch

2018-01-22 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-13043.

Resolution: Won't Do

> Backport HBASE-11436 to 94 branch
> -
>
> Key: HBASE-13043
> URL: https://issues.apache.org/jira/browse/HBASE-13043
> Project: HBase
>  Issue Type: Task
>Reporter: churro morales
>Assignee: churro morales
>Priority: Major
> Attachments: HBASE-11436-0.94.patch
>
>
> it would be nice to be able to specify key ranges for the export job in 94  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-11409) Add more flexibility for input directory structure to LoadIncrementalHFiles

2018-01-11 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales reopened HBASE-11409:


> Add more flexibility for input directory structure to LoadIncrementalHFiles
> ---
>
> Key: HBASE-11409
> URL: https://issues.apache.org/jira/browse/HBASE-11409
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: churro morales
>Assignee: churro morales
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-11409.v1.patch, HBASE-11409.v2.patch, 
> HBASE-11409.v3.patch, HBASE-11409.v4.patch, HBASE-11409.v5.patch, 
> HBASE-11409.v6.branch-1.patch
>
>
> Use case:
> We were trying to combine two very large tables into a single table.  Thus we 
> ran jobs in one datacenter that populated certain column families and another 
> datacenter which populated other column families.  Took a snapshot and 
> exported them to their respective datacenters.  Wanted to simply take the 
> hdfs restored snapshot and use LoadIncremental to merge the data.  
> It would be nice to add support where we could run LoadIncremental on a 
> directory where the depth of store files is something other than two (current 
> behavior).  
> With snapshots it would be nice if you could pass a restored hdfs snapshot's 
> directory and have the tool run.  
> I am attaching a patch where I parameterize the bulkLoad timeout as well as 
> the default store file depth.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (HBASE-19754) Backport HBASE-11409 to branch-1 and branch-1.4

2018-01-11 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-19754.

Resolution: Fixed

Moving this ticket back to the original HBASE-11409 for the branch-1 backport

> Backport HBASE-11409 to branch-1 and branch-1.4
> ---
>
> Key: HBASE-19754
> URL: https://issues.apache.org/jira/browse/HBASE-19754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0
>Reporter: churro morales
>Assignee: churro morales
>Priority: Minor
> Attachments: HBASE-19754.branch-1.patch
>
>
> backport HBASE-11409 to branch-1, branch-1.4



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (HBASE-11409) Add more flexibility for input directory structure to LoadIncrementalHFiles

2018-01-10 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-11409.

Resolution: Fixed

> Add more flexibility for input directory structure to LoadIncrementalHFiles
> ---
>
> Key: HBASE-11409
> URL: https://issues.apache.org/jira/browse/HBASE-11409
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: churro morales
>Assignee: churro morales
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-11409.v1.patch, HBASE-11409.v2.patch, 
> HBASE-11409.v3.patch, HBASE-11409.v4.patch, HBASE-11409.v5.patch
>
>
> Use case:
> We were trying to combine two very large tables into a single table.  Thus we 
> ran jobs in one datacenter that populated certain column families and another 
> datacenter which populated other column families.  Took a snapshot and 
> exported them to their respective datacenters.  Wanted to simply take the 
> hdfs restored snapshot and use LoadIncremental to merge the data.  
> It would be nice to add support where we could run LoadIncremental on a 
> directory where the depth of store files is something other than two (current 
> behavior).  
> With snapshots it would be nice if you could pass a restored hdfs snapshot's 
> directory and have the tool run.  
> I am attaching a patch where I parameterize the bulkLoad timeout as well as 
> the default store file depth.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19754) Backport HBASE-11409 to branch-1 and branch-1.4

2018-01-10 Thread churro morales (JIRA)
churro morales created HBASE-19754:
--

 Summary: Backport HBASE-11409 to branch-1 and branch-1.4
 Key: HBASE-19754
 URL: https://issues.apache.org/jira/browse/HBASE-19754
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.0, 1.5.0
Reporter: churro morales
Assignee: churro morales
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19528) Major Compaction Tool

2017-12-15 Thread churro morales (JIRA)
churro morales created HBASE-19528:
--

 Summary: Major Compaction Tool 
 Key: HBASE-19528
 URL: https://issues.apache.org/jira/browse/HBASE-19528
 Project: HBase
  Issue Type: New Feature
Reporter: churro morales
Assignee: churro morales
 Fix For: 2.0.0, 3.0.0


The basic overview of how this tool works is:

Parameters:

Table

Stores

ClusterConcurrency

Timestamp


So you input a table, desired concurrency and the list of stores you wish to 
major compact.  The tool first checks the filesystem to see which stores need 
compaction based on the timestamp you provide (default is current time).  It 
takes that list of stores that require compaction and executes those requests 
concurrently with at most N distinct RegionServers compacting at a given time.  
Each thread waits for the compaction to complete before moving to the next 
queue.  If a region split, merge or move happens this tool ensures those 
regions get major compacted as well. 

This helps us in two ways, we can limit how much I/O bandwidth we are using for 
major compaction cluster wide and we are guaranteed after the tool completes 
that all requested compactions complete regardless of moves, merges and splits. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19405) Fix RowTooBigException to use that in hbase-client and ensure that it extends DoNotRetryIOException

2017-12-01 Thread churro morales (JIRA)
churro morales created HBASE-19405:
--

 Summary: Fix RowTooBigException to use that in hbase-client and 
ensure that it extends DoNotRetryIOException
 Key: HBASE-19405
 URL: https://issues.apache.org/jira/browse/HBASE-19405
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 3.0.0, 1.4.0
Reporter: churro morales
Assignee: churro morales


Looks like between branches this is very different.  
In master the client extends the correct exception but it is not called from 
the StoreScanner.
Looking quickly at 1.4 it does not look to extend the correct exception and it 
is not called from anywhere. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (HBASE-18253) Ability to isolate regions on regionservers through hbase shell

2017-06-21 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-18253.

Resolution: Not A Problem

> Ability to isolate regions on regionservers through hbase shell
> ---
>
> Key: HBASE-18253
> URL: https://issues.apache.org/jira/browse/HBASE-18253
> Project: HBase
>  Issue Type: Task
>Affects Versions: 2.0.0-alpha-1
>Reporter: churro morales
>Assignee: Chinmay Kulkarni
>Priority: Minor
>
> Now that we have the ability to put regionservers in draining mode through 
> the hbase shell.  Another tool that would be nice is sometimes certain 
> regions need to be isolated from others (temporarily - like META).  A shell 
> command with the form:
> shell> isolate_regions '', '', ''.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-18253) Ability to isolate regions on regionservers through hbase shell

2017-06-21 Thread churro morales (JIRA)
churro morales created HBASE-18253:
--

 Summary: Ability to isolate regions on regionservers through hbase 
shell
 Key: HBASE-18253
 URL: https://issues.apache.org/jira/browse/HBASE-18253
 Project: HBase
  Issue Type: Task
Affects Versions: 2.0.0-alpha-1
Reporter: churro morales
Assignee: Chinmay Kulkarni
Priority: Minor


Now that we have the ability to put regionservers in draining mode through the 
hbase shell.  Another tool that would be nice is sometimes certain regions need 
to be isolated from others (temporarily - like META).  A shell command with the 
form:

shell> isolate_regions '', '', ''.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-17965) Canary tool should print the regionserver name on failure

2017-04-26 Thread churro morales (JIRA)
churro morales created HBASE-17965:
--

 Summary: Canary tool should print the regionserver name on failure
 Key: HBASE-17965
 URL: https://issues.apache.org/jira/browse/HBASE-17965
 Project: HBase
  Issue Type: Task
Reporter: churro morales
Assignee: Karan Mehta
Priority: Minor


It would be nice when we have a canary failure for a region to print the 
associated regionserver's name in the log as well. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17762) Add logging to HBaseAdmin for user initiated tasks

2017-03-08 Thread churro morales (JIRA)
churro morales created HBASE-17762:
--

 Summary: Add logging to HBaseAdmin for user initiated tasks
 Key: HBASE-17762
 URL: https://issues.apache.org/jira/browse/HBASE-17762
 Project: HBase
  Issue Type: Task
Reporter: churro morales
Assignee: churro morales
 Fix For: 2.0.0, 1.4.0, 0.98.25


Things like auditing a forced major compaction are really useful and right now 
there is no logging when this is triggered.  Other actions may require logging 
as well. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17698) ReplicationEndpoint choosing sinks

2017-02-24 Thread churro morales (JIRA)
churro morales created HBASE-17698:
--

 Summary: ReplicationEndpoint choosing sinks
 Key: HBASE-17698
 URL: https://issues.apache.org/jira/browse/HBASE-17698
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 1.4.0
Reporter: churro morales


The only time we choose new sinks is when we have a ConnectException, but we 
have encountered other exceptions where there is a problem contacting a 
particular sink and replication gets backed up for any sources that try that 
sink

HBASE-17675 occurred when there was a bad keytab refresh and the source was 
stuck.

Another issue we recently had was a bad drive controller on the sink side and 
replication was stuck again.  

Is there any reason not to choose new sinks anytime we have a RemoteException?  
I can understand TableNotFound we don't have to choose new sinks, but for all 
other cases this seems like the safest approach.  





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17675) ReplicationEndpoint should choose new sinks if a SaslException occurs

2017-02-21 Thread churro morales (JIRA)
churro morales created HBASE-17675:
--

 Summary: ReplicationEndpoint should choose new sinks if a 
SaslException occurs 
 Key: HBASE-17675
 URL: https://issues.apache.org/jira/browse/HBASE-17675
 Project: HBase
  Issue Type: Bug
Reporter: churro morales


We had an issue where a regionserver on our destination side failed to refresh 
the keytabs.  The source side's replication got stuck because the 
HBaseInterClusterReplicationEndpoint only chooses new sinks if there happens to 
be a ConnectException but the SaslException is an IOException, which does not 
choose new sinks.  

I'll put up a patch to check this exception and choose new sinks.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17609) Allow for region merging in the UI

2017-02-07 Thread churro morales (JIRA)
churro morales created HBASE-17609:
--

 Summary: Allow for region merging in the UI 
 Key: HBASE-17609
 URL: https://issues.apache.org/jira/browse/HBASE-17609
 Project: HBase
  Issue Type: Task
Affects Versions: 2.0.0, 1.4.0
Reporter: churro morales
Assignee: churro morales


HBASE-49 discussed having the ability to merge regions through the HBase UI, 
but online region merging wasn't around back then. 

I have created additional form fields for the table.jsp where you can pass in 
two encoded region names (must be adjacent regions) and a merge can be called 
through the UI. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-16710) Add ZStandard Codec to Compression.java

2016-09-26 Thread churro morales (JIRA)
churro morales created HBASE-16710:
--

 Summary: Add ZStandard Codec to Compression.java
 Key: HBASE-16710
 URL: https://issues.apache.org/jira/browse/HBASE-16710
 Project: HBase
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: churro morales
Assignee: churro morales
Priority: Minor


HADOOP-13578 is adding the ZStandardCodec to hadoop.  This is a placeholder to 
ensure it gets added to hbase once this gets upstream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16086) TableCfWALEntryFilter and ScopeWALEntryFilter should not redundantly iterate over cells.

2016-06-22 Thread churro morales (JIRA)
churro morales created HBASE-16086:
--

 Summary: TableCfWALEntryFilter and ScopeWALEntryFilter should not 
redundantly iterate over cells.
 Key: HBASE-16086
 URL: https://issues.apache.org/jira/browse/HBASE-16086
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: churro morales


TableCfWALEntryFilter and ScopeWALEntryFilter both filter by iterating over 
cells.  Since the filters are chained we do this work twice.  Instead iterate 
over cells once and apply the "cell filtering" logic to these cells.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15816) Provide client with ability to set priority on Operations

2016-05-11 Thread churro morales (JIRA)
churro morales created HBASE-15816:
--

 Summary: Provide client with ability to set priority on Operations 
 Key: HBASE-15816
 URL: https://issues.apache.org/jira/browse/HBASE-15816
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: churro morales
Assignee: churro morales


First round will just be to expose the ability to set priorities for client 
operations.  For more background: 
http://mail-archives.apache.org/mod_mbox/hbase-dev/201604.mbox/%3CCA+RK=_BG_o=q8HMptcP2WauAinmEsL+15f3YEJuz=qbpcya...@mail.gmail.com%3E

Next step would be to remove AnnotationReadingPriorityFunction and have the 
client send priorities explicitly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15727) Canary Tool for Zookeeper

2016-04-27 Thread churro morales (JIRA)
churro morales created HBASE-15727:
--

 Summary: Canary Tool for Zookeeper
 Key: HBASE-15727
 URL: https://issues.apache.org/jira/browse/HBASE-15727
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: churro morales
Assignee: churro morales


It would be nice to have the canary tool also monitor zookeeper.  Something 
simple like doing a getData() call on zookeeper.znode.parent

It would be nice to create clients for every instance in the quorum such that 
you could monitor overloaded or poor behaving instances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-12814) Zero downtime upgrade from 94 to 98

2016-04-01 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-12814.

Resolution: Not A Problem

Most likely everyone is off the 94 branch.  

> Zero downtime upgrade from 94 to 98 
> 
>
> Key: HBASE-12814
> URL: https://issues.apache.org/jira/browse/HBASE-12814
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 0.94.26, 0.98.10
>Reporter: churro morales
>Assignee: churro morales
> Attachments: HBASE-12814-0.94.patch, HBASE-12814-0.98.patch
>
>
> Here at Flurry we want to upgrade our HBase cluster from 94 to 98 while not 
> having any downtime and maintaining master / master replication. 
> Summary:
> Replication is done via thrift RPC between clusters.  It is configurable on a 
> peer by peer basis and the one caveat is that a thrift server starts up on 
> every node which proxies the request to the ReplicationSink.  
> For the upgrade process:
> * in hbase-site.xml two new configuration parameters are added:
> ** *Required*
> *** hbase.replication.sink.enable.thrift -> true
> *** hbase.replication.thrift.server.port -> 
> ** *Optional*
> *** hbase.replication.thrift.protection {default: AUTHENTICATION}
> *** hbase.replication.thrift.framed {default: false}
> *** hbase.replication.thrift.compact {default: true}
> - All regionservers can be rolling restarted (no downtime), all clusters must 
> have the respective patch for this to work.
> - the hbase shell add_peer command takes an additional parameter for rpc 
> protocol
> - example: {code} add_peer '1' "hbase-101:2181:/hbase", "THRIFT" {code}
> Now comes the fun part when you want to upgrade your cluster from 94 to 98 
> you simply pause replication to the cluster being upgraded, do the upgrade 
> and un-pause replication.  Once you have a pair of clusters only replicating 
> inbound and outbound with the 98 release.  You can start replicating via the 
> native rpc protocol by adding the peer again without the _THRIFT_ parameter 
> and subsequently deleting the peer with the thrift protocol.  Because 
> replication is idempotent I don't see any issues as long as you wait for the 
> backlog to drain after un-pausing replication. 
> Special thanks to Francis Liu at Yahoo for laying the groundwork and Mr. Dave 
> Latham for his invaluable knowledge and assistance.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15321) Ability to open a HRegion from hdfs snapshot.

2016-02-24 Thread churro morales (JIRA)
churro morales created HBASE-15321:
--

 Summary: Ability to open a HRegion from hdfs snapshot.
 Key: HBASE-15321
 URL: https://issues.apache.org/jira/browse/HBASE-15321
 Project: HBase
  Issue Type: New Feature
Affects Versions: 2.0.0
Reporter: churro morales
 Fix For: 2.0.0


Now that hdfs snapshots are here, we started to run our mapreduce jobs over 
hdfs snapshots.  The thing is, hdfs snapshots are read-only point-in-time 
copies of the file system.  Thus we had to modify the section of code that 
initialized the region internals in HRegion.   We have to skip cleanup of 
certain directories if the HRegion is backed by a hdfs snapshot.  I have a 
patch for trunk with some basic tests if folks are interested.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-11352) When HMaster starts up it deletes the tmp snapshot directory, if you are exporting a snapshot at that time the job will fail

2016-02-23 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-11352.

Resolution: Not A Problem

In newer versions of HBase you can just select the skipTmp option when 
Exporting Snapshots this will resolve this issue.

> When HMaster starts up it deletes the tmp snapshot directory, if you are 
> exporting a snapshot at that time the job will fail
> 
>
> Key: HBASE-11352
> URL: https://issues.apache.org/jira/browse/HBASE-11352
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.19
>Reporter: churro morales
> Attachments: HBASE-11352-0.94.patch, HBASE-11352-v2.0.94.patch
>
>
> We are exporting a very large table.  The export snapshot job takes 7+ days 
> to complete.  During that time we had to bounce HMaster.  When HMaster 
> initializes, it initializes the SnapshotManager which subsequently deletes 
> the .tmp directory.
> If this happens while the ExportSnapshot job is running the reference files 
> get removed and the job fails.
> Maybe we could put some sort of token such that when this job is running 
> HMaster wont reset the tmp directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-12889) Add scanner caching and batching options for the CopyTable job.

2016-02-23 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-12889.

Resolution: Won't Fix

> Add scanner caching and batching options for the CopyTable job.
> ---
>
> Key: HBASE-12889
> URL: https://issues.apache.org/jira/browse/HBASE-12889
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 0.98.10, 1.1.0
>Reporter: churro morales
>Assignee: churro morales
>Priority: Minor
> Attachments: HBASE-12889.0.98.patch, HBASE-12889.patch
>
>
> We use the copy table job to ship data between clusters.  Sometimes we have 
> very wide rows and it is nice to be able to set the batching and caching.  
> I'll attach trivial patches for you guys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-13031) Ability to snapshot based on a key range

2016-02-23 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-13031.

Resolution: Won't Fix

> Ability to snapshot based on a key range
> 
>
> Key: HBASE-13031
> URL: https://issues.apache.org/jira/browse/HBASE-13031
> Project: HBase
>  Issue Type: Improvement
>Reporter: churro morales
>Assignee: churro morales
> Fix For: 2.0.0, 1.3.0, 0.98.18
>
> Attachments: HBASE-13031-v1.patch, HBASE-13031.patch
>
>
> Posted on the mailing list and seems like some people are interested.  A 
> little background for everyone.
> We have a very large table, we would like to snapshot and transfer the data 
> to another cluster (compressed data is always better to ship).  Our problem 
> lies in the fact it could take many weeks to transfer all of the data and 
> during that time with major compactions, the data stored in dfs has the 
> potential to double which would cause us to run out of disk space.
> So we were thinking about allowing the ability to snapshot a specific key 
> range.  
> Ideally I feel the approach is that the user would specify a start and stop 
> key, those would be associated with a region boundary.  If between the time 
> the user submits the request and the snapshot is taken the boundaries change 
> (due to merging or splitting of regions) the snapshot should fail.
> We would know which regions to snapshot and if those changed between when the 
> request was submitted and the regions locked, the snapshot could simply fail 
> and the user would try again, instead of potentially giving the user more / 
> less than what they had anticipated.  I was planning on storing the start / 
> stop key in the SnapshotDescription and from there it looks pretty straight 
> forward where we just have to change the verifier code to accommodate the key 
> ranges.  
> If this design sounds good to anyone, or if I am overlooking anything please 
> let me know.  Once we agree on the design, I'll write and submit the patches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-12890) Provide a way to throttle the number of regions moved by the balancer

2016-02-23 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-12890.

Resolution: Won't Fix

> Provide a way to throttle the number of regions moved by the balancer
> -
>
> Key: HBASE-12890
> URL: https://issues.apache.org/jira/browse/HBASE-12890
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.98.10
>Reporter: churro morales
>Assignee: churro morales
> Fix For: 2.0.0, 1.3.0, 0.98.18
>
> Attachments: HBASE-12890.patch
>
>
> We have a very large cluster and we frequently add remove quite a few 
> regionservers from our cluster.  Whenever we do this the balancer moves 
> thousands of regions at once.  Instead we provide a configuration parameter: 
> hbase.balancer.max.regions.  This limits the number of regions that are 
> balanced per iteration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15286) Revert the API changes to TimeRange constructor and make IA.Private

2016-02-17 Thread churro morales (JIRA)
churro morales created HBASE-15286:
--

 Summary: Revert the API changes to TimeRange constructor and make 
IA.Private
 Key: HBASE-15286
 URL: https://issues.apache.org/jira/browse/HBASE-15286
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 1.2.0
Reporter: churro morales
Assignee: churro morales


Based on the discussion here: 

https://mail-archives.apache.org/mod_mbox/hbase-dev/201602.mbox/%3ccan5cbe4rs-2tv3rn1-xhaz0yt3kh3+zkg+8ewk_6kbkfkds...@mail.gmail.com%3E





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15130) Backport 0.98 Scan different TimeRange for each column family

2016-01-19 Thread churro morales (JIRA)
churro morales created HBASE-15130:
--

 Summary: Backport 0.98 Scan different TimeRange for each column 
family 
 Key: HBASE-15130
 URL: https://issues.apache.org/jira/browse/HBASE-15130
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.17
Reporter: churro morales
Assignee: churro morales
 Fix For: 0.98.18


branch 98 version backport for HBASE-14355





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15067) Rest API should support scan timeRange per column family

2016-01-04 Thread churro morales (JIRA)
churro morales created HBASE-15067:
--

 Summary: Rest API should support scan timeRange per column family
 Key: HBASE-15067
 URL: https://issues.apache.org/jira/browse/HBASE-15067
 Project: HBase
  Issue Type: New Feature
Reporter: churro morales


see discussion in HBASE-14872



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14872) Scan different timeRange per column family doesn't percolate down to the memstore

2015-11-23 Thread churro morales (JIRA)
churro morales created HBASE-14872:
--

 Summary: Scan different timeRange per column family doesn't 
percolate down to the memstore 
 Key: HBASE-14872
 URL: https://issues.apache.org/jira/browse/HBASE-14872
 Project: HBase
  Issue Type: Bug
  Components: Client, regionserver, Scanners
Affects Versions: 2.0.0, 1.3.0
Reporter: churro morales
Assignee: churro morales
 Fix For: 2.0.0, 1.3.0, 0.98.17


HBASE-14355 The scan different time range for column family feature was not 
applied to the memstore it was only done for the store files.  This breaks the 
contract.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14129) If any regionserver gets shutdown uncleanly during full cluster restart, locality looks to be lost

2015-07-21 Thread churro morales (JIRA)
churro morales created HBASE-14129:
--

 Summary: If any regionserver gets shutdown uncleanly during full 
cluster restart, locality looks to be lost
 Key: HBASE-14129
 URL: https://issues.apache.org/jira/browse/HBASE-14129
 Project: HBase
  Issue Type: Bug
Reporter: churro morales


We were doing a cluster restart the other day.  Some regionservers did not shut 
down cleanly.  Upon restart our locality went from 99% to 5%.  Upon looking at 
the AssignmentManager.joinCluster() code it calls 
AssignmentManager.processDeadServersAndRegionsInTransition().
If the failover flag gets set for any reason it seems we don't call 
assignAllUserRegions().  Then it looks like the balancer does the work in 
assigning those regions, we don't use a locality aware balancer and we lost our 
region locality.

I don't have a solid grasp on the reasoning for these checks but there could be 
some potential workarounds here.

1. After shutting down your cluster, move your WALs aside (replay later).  
2. Clean up your zNodes 

That seems to work, but requires a lot of manual labor.  Another solution which 
I prefer would be to have a flag for ./start-hbase.sh --clean 

If we start master with that flag then we do a check in 
AssignmentManager.processDeadServersAndRegionsInTransition()  thus if this flag 
is set we call: assignAllUserRegions() regardless of the failover state.

I have a patch for the later solution, that is if I am understanding the logic 
correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13724) ReplicationSource dies under certain conditions reading a sequence file

2015-05-20 Thread churro morales (JIRA)
churro morales created HBASE-13724:
--

 Summary: ReplicationSource dies under certain conditions reading a 
sequence file
 Key: HBASE-13724
 URL: https://issues.apache.org/jira/browse/HBASE-13724
 Project: HBase
  Issue Type: Bug
Reporter: churro morales


A little background, 

We run our server in -ea mode and have seen quite a few replication sources 
silently die over the past few months.

Note: the stacktrace I posted below comes from a regionserver running 0.94 but 
quickly looking at this issue, I believe this will happen in 98 too.  

Should we harden replication source to deal with these types of assertion 
errors by catching throwables, should we be dealing with this at the sequence 
file reader level?  Still looking into the root cause of this issue but when 
manually shutdown our regionservers the regionserver that recovered its queue 
replicated that log just fine.  So in our case a simple retry would've worked 
just fine.  

{code}
2015-05-08 11:04:23,348 ERROR 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected 
exception in ReplicationSource, 
currentPath=hdfs://hm6.xxx.flurry.com:9000/hbase/.logs/x.yy.flurry.com,60020,1426792702998/x.atl.flurry.com%2C60020%2C1426792702998.1431107922449
java.lang.AssertionError
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader$WALReaderFSDataInputStream.getPos(SequenceFileLogReader.java:121)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1489)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1479)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1474)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:178)
at 
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:734)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:69)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:583)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:373)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-13042) MR Job to export HFiles directly from an online cluster

2015-04-23 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-13042.

Resolution: Fixed

 MR Job to export HFiles directly from an online cluster
 ---

 Key: HBASE-13042
 URL: https://issues.apache.org/jira/browse/HBASE-13042
 Project: HBase
  Issue Type: New Feature
Reporter: Dave Latham

 We're looking at the best way to bootstrap a new remote cluster.  The source 
 cluster has a a large table of compressed data using more than 50% of the 
 HDFS capacity and we have a WAN link to the remote cluster.  Ideally we would 
 set up replication to a new table remotely, snapshot the source table, copy 
 the snapshot across, then bulk load it into the new table.  However the 
 amount of time to copy the data remotely is greater than the major compaction 
 interval so the source cluster would run out of storage.
 One approach is HBASE-13031 to allow the operators to snapshot and copy one 
 key range at a time.  Here's another idea:
 Create a MR job that tries to do a robust remote HFile copy directly:
  * Each split is responsible for a key range.
  * Map task lookups up that key range and maps it to a set of HDFS store 
 directories (one for each region/family)
  * For each store:
** List HFiles in store (needs to be less than 1000 files to guarantee 
 atomic listing)
** Attempt to copy store files (copy in increasing size order to minimize 
 likelihood of compaction removing a file during copy)
** If some of the files disappear (compaction), retry directory list / copy
  * If any of the stores disappear (region split / merge) then retry map task 
 (and remap key range to stores)
 Or maybe there are some HBase locking mechanisms for a region or store that 
 would be better.  Otherwise the question is how often would compactions or 
 region splits force retries.
 Is this crazy? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13459) A more robust Verify Replication

2015-04-13 Thread churro morales (JIRA)
churro morales created HBASE-13459:
--

 Summary: A more robust Verify Replication 
 Key: HBASE-13459
 URL: https://issues.apache.org/jira/browse/HBASE-13459
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.98.12, 2.0.0, 1.0.1
Reporter: churro morales
Assignee: churro morales
Priority: Minor


We have done quite a bit of data center migration work in the past year.  We 
modified verify replication a bit to help us out.

Things like:
Ignoring timestamps when comparing Cells
More detailed counters when discrepancies are reported between rows added the 
following counters: 
SOURCEMISSINGROWS,TARGETMISSINGROWS,SOURCEMISSINGKEYS, TARGETMISSINGKEYS
Also added the ability to run this job on any pair of tables and clusters.

If folks are interested I can put up the patch and backport.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13043) Backport HBASE-11436 to 94 branch

2015-02-13 Thread churro morales (JIRA)
churro morales created HBASE-13043:
--

 Summary: Backport HBASE-11436 to 94 branch
 Key: HBASE-13043
 URL: https://issues.apache.org/jira/browse/HBASE-13043
 Project: HBase
  Issue Type: Task
Reporter: churro morales
Assignee: churro morales


it would be nice to be able to specify key ranges for the export job in 94  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13031) Ability to snapshot based on a key range

2015-02-12 Thread churro morales (JIRA)
churro morales created HBASE-13031:
--

 Summary: Ability to snapshot based on a key range
 Key: HBASE-13031
 URL: https://issues.apache.org/jira/browse/HBASE-13031
 Project: HBase
  Issue Type: Brainstorming
Affects Versions: 0.94.26, 2.0.0, 1.1.0, 0.98.11
Reporter: churro morales
Assignee: churro morales
Priority: Critical


Posted on the mailing list and seems like some people are interested.  A little 
background for everyone.

We have a very large table, we would like to snapshot and transfer the data to 
another cluster (compressed data is always better to ship).  Our problem lies 
in the fact it could take many weeks to transfer all of the data and during 
that time with major compactions, the data stored in dfs has the potential to 
double which would cause us to run out of disk space.

So we were thinking about allowing the ability to snapshot a specific key 
range.  

Ideally I feel the approach is that the user would specify a start and stop 
key, those would be associated with a region boundary.  If between the time the 
user submits the request and the snapshot is taken the boundaries change (due 
to merging or splitting of regions) the snapshot should fail.

We would know which regions to snapshot and if those changed between when the 
request was submitted and the regions locked, the snapshot could simply fail 
and the user would try again, instead of potentially giving the user more / 
less than what they had anticipated.  I was planning on storing the start / 
stop key in the SnapshotDescription and from there it looks pretty straight 
forward where we just have to change the verifier code to accommodate the key 
ranges.  

If this design sounds good to anyone, or if I am overlooking anything please 
let me know.  Once we agree on the design, I'll write and submit the patches.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13033) Max allowed memstore size should be 80% not 90%

2015-02-12 Thread churro morales (JIRA)
churro morales created HBASE-13033:
--

 Summary: Max allowed memstore size should be 80% not 90% 
 Key: HBASE-13033
 URL: https://issues.apache.org/jira/browse/HBASE-13033
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.11
Reporter: churro morales
Assignee: churro morales
Priority: Minor


Currently in MemstoreFlusher the check for maximum allowed memstore size is set 
to 90% and it should be 80%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-13033) Max allowed memstore size should be 80% not 90%

2015-02-12 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-13033.

Resolution: Invalid

 Max allowed memstore size should be 80% not 90% 
 

 Key: HBASE-13033
 URL: https://issues.apache.org/jira/browse/HBASE-13033
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.11
Reporter: churro morales
Assignee: churro morales
Priority: Minor

 Currently in MemstoreFlusher the check for maximum allowed memstore size is 
 set to 90% and it should be 80%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12897) Minimum memstore size is a percentage

2015-01-21 Thread churro morales (JIRA)
churro morales created HBASE-12897:
--

 Summary: Minimum memstore size is a percentage
 Key: HBASE-12897
 URL: https://issues.apache.org/jira/browse/HBASE-12897
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 0.98.10, 1.1.0
Reporter: churro morales
Assignee: churro morales


We have a cluster which is optimized for random reads.  Thus we have a large 
block cache and a small memstore.  Currently our heap is 20GB and we wanted to 
configure the memstore to take 4% or 800MB.  Right now the minimum memstore 
size is 5%.  What do you guys think about reducing the minimum size to 1%?  
Suppose we log a warning if the memstore is below 5% but allow it?

What do you folks think? 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12890) Provide a way to throttle the number of regions moved by the balancer

2015-01-20 Thread churro morales (JIRA)
churro morales created HBASE-12890:
--

 Summary: Provide a way to throttle the number of regions moved by 
the balancer
 Key: HBASE-12890
 URL: https://issues.apache.org/jira/browse/HBASE-12890
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.0.0, 0.98.10, 1.1.0
Reporter: churro morales
Assignee: churro morales


We have a very large cluster and we frequently add remove quite a few 
regionservers from our cluster.  Whenever we do this the balancer moves 
thousands of regions at once.  Instead we provide a configuration parameter: 
hbase.balancer.max.regions.  This limits the number of regions that are 
balanced per iteration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12889) Add scanner caching and batching options for the CopyTable job.

2015-01-20 Thread churro morales (JIRA)
churro morales created HBASE-12889:
--

 Summary: Add scanner caching and batching options for the 
CopyTable job.
 Key: HBASE-12889
 URL: https://issues.apache.org/jira/browse/HBASE-12889
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.0.0, 0.98.10, 1.1.0
Reporter: churro morales
Assignee: churro morales
Priority: Minor


We use the copy table job to ship data between clusters.  Sometimes we have 
very wide rows and it is nice to be able to set the batching and caching.  I'll 
attach trivial patches for you guys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12891) have hbck do region consistency checks in parallel

2015-01-20 Thread churro morales (JIRA)
churro morales created HBASE-12891:
--

 Summary: have hbck do region consistency checks in parallel
 Key: HBASE-12891
 URL: https://issues.apache.org/jira/browse/HBASE-12891
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.0.0, 0.98.10, 1.1.0
Reporter: churro morales
Assignee: churro morales


We have a lot of regions on our cluster ~500k and noticed that hbck took quite 
some time in checkAndFixConsistency().  [~davelatham] patched our cluster to do 
this check in parallel to speed things up.  I'll attach the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12814) Zero downtime upgrade from 94 to 98 with replication

2015-01-06 Thread churro morales (JIRA)
churro morales created HBASE-12814:
--

 Summary: Zero downtime upgrade from 94 to 98 with replication
 Key: HBASE-12814
 URL: https://issues.apache.org/jira/browse/HBASE-12814
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.94.26, 0.98.10
Reporter: churro morales
Assignee: churro morales


Here at Flurry we want to upgrade our HBase cluster from 94 to 98 while not 
having any downtime and maintaining master / master replication. 

Summary:
Replication is done via thrift RPC between clusters.  It is configurable on a 
peer by peer basis and the one caveat is that a thrift server starts up on 
every node which proxies the request to the ReplicationSink.  


For the upgrade process:
* in hbase-site.xml two new configuration parameters are added:
** *Required*
*** hbase.replication.sink.enable.thrift - true
*** hbase.replication.thrift.server.port - thrit_server_port
** *Optional*
*** hbase.replication.thrift.protection {default: AUTHENTICATION}
*** hbase.replication.thrift.framed {default: false}
*** hbase.replication.thrift.compact {default: true}

- All regionservers can be rolling restarted (no downtime), all clusters must 
have the respective patch for this to work.
- the hbase shell add_peer command takes an additional parameter for rpc 
protocol
- example: {code} add_peer '1' hbase-101:2181:/hbase, THRIFT {code}

Now comes the fun part when you want to upgrade your cluster from 94 to 98 you 
simply pause replication to the cluster being upgraded, do the upgrade and 
un-pause replication.  Once you have a pair of clusters only replicating 
inbound and outbound with the 98 release.  You can start replicating via the 
native rpc protocol by adding the peer again without the _THRIFT_ parameter and 
subsequently deleting the peer with the thrift protocol.  Because replication 
is idempotent I don't see any issues as long as you wait for the backlog to 
drain after un-pausing replication. 

Special thanks to Francis Liu at Yahoo for laying the groundwork and Mr. Dave 
Latham for his invaluable knowledge and assistance.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-11601) Parallelize Snapshot operations for 0.94

2014-07-28 Thread churro morales (JIRA)
churro morales created HBASE-11601:
--

 Summary: Parallelize Snapshot operations for 0.94
 Key: HBASE-11601
 URL: https://issues.apache.org/jira/browse/HBASE-11601
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.21
Reporter: churro morales


Although HBASE-11185 exists, it is geared towards the snapshot manifest code.  
We have used snapshots to ship our two largest tables across the country and 
while doing so found a few potential optimizations where doing things in 
parallel helped quite a bit.  I can attach a patch containing changes I've made 
and we can discuss if these are changes worth getting pushed to 0.94.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11528) The restoreSnapshot operation should delete the rollback snapshot upon a successful restore

2014-07-16 Thread churro morales (JIRA)
churro morales created HBASE-11528:
--

 Summary: The restoreSnapshot operation should delete the rollback 
snapshot upon a successful restore
 Key: HBASE-11528
 URL: https://issues.apache.org/jira/browse/HBASE-11528
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.20
Reporter: churro morales
Assignee: churro morales
Priority: Minor


We take a snapshot: rollbackSnapshot prior to doing a restore such that if 
the restore fails we can revert the table back to its pre-restore state.  If we 
are successful in restoring the table, we should delete the rollbackSnapshot 
when the restoreSnapshot operation successfully completes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11409) Add more flexibility for input directory structure to LoadIncrementalHFiles

2014-06-24 Thread churro morales (JIRA)
churro morales created HBASE-11409:
--

 Summary: Add more flexibility for input directory structure to 
LoadIncrementalHFiles
 Key: HBASE-11409
 URL: https://issues.apache.org/jira/browse/HBASE-11409
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.20
Reporter: churro morales


Use case:

We were trying to combine two very large tables into a single table.  Thus we 
ran jobs in one datacenter that populated certain column families and another 
datacenter which populated other column families.  Took a snapshot and exported 
them to their respective datacenters.  Wanted to simply take the hdfs restored 
snapshot and use LoadIncremental to merge the data.  

It would be nice to add support where we could run LoadIncremental on a 
directory where the depth of store files is something other than two (current 
behavior).  

With snapshots it would be nice if you could pass a restored hdfs snapshot's 
directory and have the tool run.  

I am attaching a patch where I parameterize the bulkLoad timeout as well as the 
default store file depth.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11360) SnapshotFileCache refresh logic based on modified directory time might be insufficient

2014-06-16 Thread churro morales (JIRA)
churro morales created HBASE-11360:
--

 Summary: SnapshotFileCache refresh logic based on modified 
directory time might be insufficient
 Key: HBASE-11360
 URL: https://issues.apache.org/jira/browse/HBASE-11360
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.19
Reporter: churro morales


Right now we decide whether to refresh the cache based on the lastModified 
timestamp of all the snapshots and those running snapshots which is located 
in the /hbase/.hbase-snapshot/.tmp/snapshot directory

We ran a ExportSnapshot job which takes around 7 minutes between creating the 
directory and copying all the files. 

Thus the modified time for the 
/hbase/.hbase-snapshot/.tmp directory was 7 minutes earlier than the modified 
time of the
/hbase/.hbase-snapshot/.tmp/snapshot directory

Thus the cache refresh happens and doesn't pick up all the files but thinks its 
up to date as the modified time of the .tmp directory never changes.

This is a bug as when the export job starts the cache never contains the files 
for the running snapshot and will fail.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11352) When HMaster starts up it deletes the tmp snapshot directory, if you are exporting a snapshot at that time the job will fail

2014-06-13 Thread churro morales (JIRA)
churro morales created HBASE-11352:
--

 Summary: When HMaster starts up it deletes the tmp snapshot 
directory, if you are exporting a snapshot at that time the job will fail
 Key: HBASE-11352
 URL: https://issues.apache.org/jira/browse/HBASE-11352
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.19
Reporter: churro morales


We are exporting a very large table.  The export snapshot job takes 7+ days to 
complete.  During that time we had to bounce HMaster.  When HMaster 
initializes, it initializes the SnapshotManager which subsequently deletes the 
.tmp directory.

If this happens while the ExportSnapshot job is running the reference files get 
removed and the job fails.

Maybe we could put some sort of token such that when this job is running 
HMaster wont reset the tmp directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11322) SnapshotHFileCleaner makes the wrong check for lastModified time thus causing too many cache refreshes

2014-06-10 Thread churro morales (JIRA)
churro morales created HBASE-11322:
--

 Summary: SnapshotHFileCleaner makes the wrong check for 
lastModified time thus causing too many cache refreshes
 Key: HBASE-11322
 URL: https://issues.apache.org/jira/browse/HBASE-11322
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.19
Reporter: churro morales
Assignee: churro morales
Priority: Critical


In the SnapshotFileCache:
The last modified time is done via this operation:
{code}
this.lastModifiedTime = Math.min(dirStatus.getModificationTime(),
 tempStatus.getModificationTime());
{code}

and the check to see if the snapshot directories have been modified:
{code}
// if the snapshot directory wasn't modified since we last check, we are done
if (dirStatus.getModificationTime() = lastModifiedTime 
tempStatus.getModificationTime() = lastModifiedTime) {
  return;
}
{code}

so if the dirStatus and tmpStatus are modified at different times, we will 
always assume they have been modified and refresh the cache.

In our cluster, this was a huge performance hit.  The cleaner chain fell 
behind, thus almost filling up dfs and our namenode heap.

Its a simple fix, instead of Math.min we use Math.max for the lastModified, I 
believe that will be correct.

I'll apply a patch for you guys.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11195) Potentially improve block locality during major compaction for old regions

2014-05-16 Thread churro morales (JIRA)
churro morales created HBASE-11195:
--

 Summary: Potentially improve block locality during major 
compaction for old regions
 Key: HBASE-11195
 URL: https://issues.apache.org/jira/browse/HBASE-11195
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.19
Reporter: churro morales


This might be a specific use case.  But we have some regions which are no 
longer written to (due to the key).  Those regions have 1 store file and they 
are very old, they haven't been written to in a while.  We still use these 
regions to read from so locality would be nice.  

I propose putting a configuration option: something like
hbase.hstore.min.locality.to.skip.major.compact [between 0 and 1]

such that you can decide whether or not to skip major compaction for an old 
region with a single store file.

I'll attach a patch, let me know what you guys think.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-10528) DefaultBalancer selects plans to move regions onto draining nodes

2014-02-13 Thread churro morales (JIRA)
churro morales created HBASE-10528:
--

 Summary: DefaultBalancer selects plans to move regions onto 
draining nodes
 Key: HBASE-10528
 URL: https://issues.apache.org/jira/browse/HBASE-10528
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.5
Reporter: churro morales


We have quite a large cluster  100k regions, and we needed to isolate a region 
was very hot until we could push a patch.  We put this region on its own 
regionserver and set it in the draining state.  The default balancer was 
selecting regions to move to this cluster for its region plans.  

It just so happened that there were very small regions on the draining servers, 
which constantly triggered balancing.  Thus we were closing regions, then 
attempting to move to the draining server finding out its draining.

There are some approaches we can take here.

1. Exclude draining servers altogether, don't even pass those into the load 
balancer from HMaster.

2. We could exclude draining servers from ceiling and floor calculations where 
we could potentially skip load balancing because those draining servers wont be 
represented when deciding whether to balance.

3. Along with #2 when assigning regions, we would skip plans to assign regions 
to those draining servers.

I am in favor of #1 which is simply removes servers as candidates for balancing 
if they are in the draining state.

But I would love to hear what everyone else thinks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (HBASE-10133) ReplicationSource currentNbOperations overflows

2013-12-12 Thread churro morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

churro morales resolved HBASE-10133.


Resolution: Invalid

 ReplicationSource currentNbOperations overflows 
 

 Key: HBASE-10133
 URL: https://issues.apache.org/jira/browse/HBASE-10133
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.95.0, 0.96.0, 0.94.14
Reporter: churro morales
Priority: Minor

 Noticed in the logs we had lines like this: 
 2013-12-11 00:02:00,343 DEBUG 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 
 currentNbOperations:-1341767084 and seenEntries:0 and size: 0
 Maybe this value should be reset after we ship our edits this value should 
 get adjusted.  Either that or convert from an int to a long.  
 As this is a jmx metric I feel its important to get this correct.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HBASE-10133) ReplicationSource currentNbOperations overflows

2013-12-11 Thread churro morales (JIRA)
churro morales created HBASE-10133:
--

 Summary: ReplicationSource currentNbOperations overflows 
 Key: HBASE-10133
 URL: https://issues.apache.org/jira/browse/HBASE-10133
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.14, 0.96.0, 0.95.0
Reporter: churro morales
Priority: Minor


Noticed in the logs we had lines like this: 

2013-12-11 00:02:00,343 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 
currentNbOperations:-1341767084 and seenEntries:0 and size: 0

Maybe this value should be reset after we ship our edits this value should get 
adjusted.  Either that or convert from an int to a long.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HBASE-10100) Hbase replication cluster can have varying peers under certain conditions

2013-12-06 Thread churro morales (JIRA)
churro morales created HBASE-10100:
--

 Summary: Hbase replication cluster can have varying peers under 
certain conditions
 Key: HBASE-10100
 URL: https://issues.apache.org/jira/browse/HBASE-10100
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.96.0, 0.95.0, 0.94.5
Reporter: churro morales


We were trying to replicate hbase data over to a new datacenter recently.  
After we turned on replication and then did our copy tables.  We noticed that 
verify replication had discrepancies.  

We ran a list_peers and it returned back both peers, the original datacenter we 
were replicating to and the new datacenter (this was correct).  

When grepping through the logs for a few regionservers we noticed that a few 
regionservers had the following entry in their logs:

2013-09-26 10:55:46,907 ERROR 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Error while adding a new peer java.net.UnknownHostException: xxx.xxx.flurry.com 
(this was due to a transient dns issue)

Thus a very small subet of our regionservers were not replicating to this new 
cluster while most were. 

We probably don't want to abort if this type of issue comes up, it could 
potentially be fatal if someone does an add_peer operation with a typo.  This 
could potentially shut down the cluster. 

One solution I can think of is keeping some flag in ReplicationSourceManager 
which is a boolean that keeps track of whether there was an errorAddingPeer.  
Then in the logPositionAndCleanOldLogs we can do something like:

{code}
if (errorAddingPeer) {
  LOG.error(There was an error adding a peer, logs will not be marked for 
deletion);
  return;
}
{code}

thus we are not deleting these logs from the queue.  You will notice your 
replicating queue rising on certain machines and you can still replay the logs, 
thus avoiding a lengthy copy table. 

I have a patch (with unit test) for the above proposal, if everyone thinks that 
is an okay solution.

An additional idea would be to add some retry logic inside the PeersWatcher 
class for the nodeChildrenChanged method.  Thus if there happens to be some 
issue we could sort it out without having to bounce that particular 
regionserver.  

Would love to hear everyones thoughts.









--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HBASE-9865) WALEdit.heapSize() is incorrect in certain replication scenarios which may cause RegionServers to go OOM

2013-10-30 Thread churro morales (JIRA)
churro morales created HBASE-9865:
-

 Summary: WALEdit.heapSize() is incorrect in certain replication 
scenarios which may cause RegionServers to go OOM
 Key: HBASE-9865
 URL: https://issues.apache.org/jira/browse/HBASE-9865
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.95.0, 0.94.5
Reporter: churro morales


WALEdit.heapSize() is incorrect in certain replication scenarios which may 
cause RegionServers to go OOM.

A little background on this issue.  We noticed that our source replication 
regionservers would get into gc storms and sometimes even OOM. 
We noticed a case where it showed that there were around 25k WALEdits to 
replicate, each one with an ArrayList of KeyValues.  The array list had a 
capacity of around 90k (using 350KB of heap memory) but had around 6 non null 
entries.

When the ReplicationSource.readAllEntriesToReplicateOrNextFile() gets a WALEdit 
it removes all kv's that are scoped other than local.  

But in doing so we don't account for the capacity of the ArrayList when 
determining heapSize for a WALEdit.  The logic for shipping a batch is whether 
you have hit a size capacity or number of entries capacity.  

Therefore if have a WALEdit with 25k entries and suppose all are removed: 
The size of the arrayList is 0 (we don't even count the collection's heap size 
currently) but the capacity is ignored.
This will yield a heapSize() of 0 bytes while in the best case it would be at 
least 10 bytes (provided you pass initialCapacity and you have 32 bit JVM) 

I have some ideas on how to address this problem and want to know everyone's 
thoughts:

1. We use a probabalistic counter such as HyperLogLog and create something like:
* class CapacityEstimateArrayList implements ArrayList
** this class overrides all additive methods to update the 
probabalistic counts
** it includes one additional method called estimateCapacity 
(we would take estimateCapacity - size() and fill in sizes for all references)
* Then we can do something like this in WALEdit.heapSize:

{code}
  public long heapSize() {
long ret = ClassSize.ARRAYLIST;
for (KeyValue kv : kvs) {
  ret += kv.heapSize();
}
long nullEntriesEstimate = kvs.getCapacityEstimate() - kvs.size();
ret += ClassSize.align(nullEntriesEstimate * ClassSize.REFERENCE);
if (scopes != null) {
  ret += ClassSize.TREEMAP;
  ret += ClassSize.align(scopes.size() * ClassSize.MAP_ENTRY);
  // TODO this isn't quite right, need help here
}
return ret;
  } 
{code}

2. In ReplicationSource.removeNonReplicableEdits() we know the size of the 
array originally, and we provide some percentage threshold.  When that 
threshold is met (50% of the entries have been removed) we can call 
kvs.trimToSize()

3. in the heapSize() method for WALEdit we could use reflection (Please don't 
shoot me for this) to grab the actual capacity of the list.  Doing something 
like this:

{code}
public int getArrayListCapacity()  {
try {
  Field f = ArrayList.class.getDeclaredField(elementData);
  f.setAccessible(true);
  return ((Object[]) f.get(kvs)).length;
} catch (Exception e) {
  log.warn(Exception in trying to get capacity on ArrayList, e);
  return kvs.size();
}
{code}


I am partial to (1) using HyperLogLog and creating a CapacityEstimateArrayList, 
this is reusable throughout the code for other classes that implement HeapSize 
which contains ArrayLists.  The memory footprint is very small and it is very 
fast.  The issue is that this is an estimate, although we can configure the 
precision we most likely always be conservative.  The estimateCapacity will 
always be less than the actualCapacity, but it will be close. I think that 
putting the logic in removeNonReplicableEdits will work, but this only solves 
the heapSize problem in this particular scenario.  Solution 3 is slow and 
horrible but that gives us the exact answer.

I would love to hear if anyone else has any other ideas on how to remedy this 
problem?  I have code for trunk and 0.94 for all 3 ideas and can provide a 
patch if the community thinks any of these approaches is a viable one.





--
This message was sent by Atlassian JIRA
(v6.1#6144)