Re: removing cells in minor compaction

2017-06-19 Thread Dave Latham
anisms and will disappear after that happens. Doing the GC during minor compactions as well as major ones would change that visibility window, but doesn't seem to change that odd behavior that is there to begin with. On Wed, Jun 14, 2017 at 5:51 PM, Dave Latham wrote: > What cells, if any, a

removing cells in minor compaction

2017-06-14 Thread Dave Latham
What cells, if any, are removed during minor compactions? Cells that (a) are beyond the TTL? (b) are shadowed by a delete marker? (from the files compacted) (c) are shadowed by newer versions? (assuming numVersions configured < num versions of the cell found)

Re: All memstores flushed to be quite small files

2017-03-27 Thread Dave Latham
Do you have compression enabled, and is your data highly compressible? On Mon, Mar 27, 2017 at 6:26 AM, Hef wrote: > Hi, > Does anyone have an idea why most of my 128MB memstore flushed files are > only several MBs? > > There are a lot of logs look as below: > > 2017-03-27 13:10:25,064 INFO > or

Re: Creating HBase table with presplits

2016-11-28 Thread Dave Latham
If you truly have no way to predict anything about the distribution of your data across the row key space, then you are correct that there is no way to presplit your regions in an effective way. Either you need to make some starting guess, such as a small number of uniform splits, or wait until yo

Re: Timestamp of HBase data

2016-04-11 Thread Dave Latham
Hi Zheng, Your intuition is correct. If the client does not specify a timestamp for writes, then the region server will use the system clock to do so. If you send a Put to a region hosted by a server with a clock that is 50 seconds slow, and that region has existing Cell(s) with the same row & c

Re: why Hbase only split regions in one RegionServer

2016-03-19 Thread Dave Latham
What if someone doesn't know the distribution of their row keys? HBase should be able to handle this case. On Wed, Mar 16, 2016 at 7:18 AM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Balancer is not moving regions that are compacting, right? He is just > pusing to much load on a non

Re: why Hbase only split regions in one RegionServer

2016-03-19 Thread Dave Latham
ying to fix it. > > JMS > > 2016-03-16 10:54 GMT-04:00 Dave Latham : > > > What if someone doesn't know the distribution of their row keys? > > HBase should be able to handle this case. > > > > On Wed, Mar 16, 2016 at 7:18 AM, Jean-Marc Spaggiari < >

Re: hbase timerange scan

2015-11-05 Thread Dave Latham
Don't think that's correct. If you look at StoreFileScanner.shouldUseScanner you can see that it will skip entire store files if the time range for a scan does not intersect with the time range of data in the store file. However, without tiered compaction there is nothing built in to optimize gro

Re: scan column families with different time ranges

2015-08-03 Thread Dave Latham
I have not tried out stripe compaction and don't see how it would help here. On Mon, Aug 3, 2015 at 12:16 PM, Ted Yu wrote: > bq. revive some notion of tiered compaction > > Did you have a chance to try out Stripe compaction ? > > Thanks > > On Mon, Aug 3, 2015 at 11

Re: scan column families with different time ranges

2015-08-03 Thread Dave Latham
hether the remaining column families should be > loaded. > > To be specific, if outside the TimeRange you specify (last day), your > > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW. > > > > What do you think ? > > > > Cheers > > > > On Sat,

Re: scan column families with different time ranges

2015-08-03 Thread Dave Latham
case) to guide whether the remaining column families should be > loaded. > > To be specific, if outside the TimeRange you specify (last day), your > > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW. > > > > What do you think ? > > > > Cheers > > >

Re: scan column families with different time ranges

2015-08-01 Thread Dave Latham
> > branch-1 for upcoming unscheduled minor release line 1.3. Would that > work? > > Or would this change need to go further back? > > > > Maybe someone else has another suggestion. > > > > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham wrote: > &g

Re: scan column families with different time ranges

2015-08-01 Thread Dave Latham
n you achieve your goal with two scans ? > The first scan specifies TimeRange corresponding to last day. This scan > returns both column families. > The other scan specifies TimeRange excluding last day. This scan returns > column family A. > > Cheers > > On Sat, Aug 1, 2015 at

Re: scan column families with different time ranges

2015-08-01 Thread Dave Latham
nly family A is returned. > > Cheers > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham wrote: > > > I have a table with 2 column families, call them A and B, with new data > > regularly being added. They are very different sizes: B is 100x the size > of > > A. Amon

scan column families with different time ranges

2015-08-01 Thread Dave Latham
I have a table with 2 column families, call them A and B, with new data regularly being added. They are very different sizes: B is 100x the size of A. Among other uses for this data, I have a MapReduce job that needs to read all of A, but only recent data from B (e.g. last day). Here are some met

Re: bulk load doubts

2015-07-21 Thread Dave Latham
For #1, as Ted mentioned HDFS replication will work just fine with bulk loads. What you may have read is that bulk loaded data won't be picked up by HBase replication. If you are using HBase replication to send data to another cluster, then you need to also manage getting the bulk loaded data to

Re: Regionservers going down during compaction

2015-07-13 Thread Dave Latham
What JDK are you using? I've seen such behavior when a machine was swapping. Can you tell if there was any swap in use? On Mon, Jul 13, 2015 at 3:24 AM, Ankit Singhal wrote: > Hi Team, > > We are seeing regionservers getting down whenever major compaction is > triggered on table(8.5TB size). >

Re: How to monitor and control heavy network traffic for region server

2015-06-11 Thread Dave Latham
will need to implement on top of >> ResultScanner - ThrottledResultScanner. >> >> Good idea for improvement, actually. >> >> -Vlad >> >> On Wed, Jun 10, 2015 at 7:30 PM, Louis Hust wrote: >> >> > hi, Dave, >> > >> > For now we

Re: How to monitor and control heavy network traffic for region server

2015-06-10 Thread Dave Latham
I'm not aware of anything in version 0.96 that will limit the scan for you - you may have to do it in your client yourself. If you're willing to upgrade, do check out the throttling available in HBase 1.1: https://blogs.apache.org/hbase/entry/the_hbase_request_throttling_feature On Wed, Jun 10,

Re: Issues with import from 0.92 into 0.98

2015-05-27 Thread Dave Latham
On Wed, May 27, 2015 at 11:17 AM, wrote: > Thanks! I want to make sure I've got it right: > > When I import the 0.92 data into 0.98, the columns are defined properly > in the 0.98 table, but I cannot perform a scan with a column filter in > the shell as the shell interprets the second ':' in the

Re: Issues with import from 0.92 into 0.98

2015-05-27 Thread Dave Latham
).toInt'] } Note that you can specify a FORMATTER by column only (cf:qualifer). You cannot specify a FORMATTER for all columns of a column family. On Wed, May 27, 2015 at 10:23 AM, wrote: > On Wed, May 27, 2015, at 11:35 AM, Dave Latham wrote: >> Sounds like quite a puzzle. >> >

Re: Issues with import from 0.92 into 0.98

2015-05-27 Thread Dave Latham
Sounds like quite a puzzle. You mentioned that you can read data written through manual Puts from the shell - but not data from the Import. There must be something different about the data itself once it's in the table. Can you compare a row that was imported to a row that was manually written -

Re: How to Restore the block locality of a RegionServer ?

2015-05-09 Thread Dave Latham
Major compactions will fix locality, so long as there is space on the local data nodes and they actually happen. Also, if there is already only a single HFile in a store, major compaction may be skipped. Newer versions of hbase have a parameter hbase.hstore.min.locality.to.skip.major.compact that

Re: toStringBinary output is painful

2015-04-14 Thread Dave Latham
> I believe toStringBinary does all ascii if input ascii-only. Right, but it will also mix ascii range characters with binary. > Our output was in part shaped by ruby binary String representation. It was > thought useful that you could copy from shell and find in UI, and > vice-versa. I don't thi

toStringBinary output is painful

2015-04-13 Thread Dave Latham
Wish I had started this conversation 5 years ago... When we're using binary data, especially in row keys (and therefore region boundaries) the output created by toStringBinary is very painful to use: - mix of ascii / hex representation is trouble - it's quite long (4 characters per binary byte)

Re: Poll: HBase usage by HBase version

2015-03-18 Thread Dave Latham
If you haven't already seen it - take a look at the bridge at https://issues.apache.org/jira/browse/HBASE-12814 We're using it to go through the process now. Dave On Wed, Mar 18, 2015 at 5:46 PM, Bryan Beaudreault wrote: > My only complaint about this poll is the labels: "0.94.x - I like stable

Re: Different time ranges for different cfs when using TableInputFormat

2015-03-04 Thread Dave Latham
That's not possible with HBase today. The simplest thing may be to set your Scan time range to include both today's and yesterday's data and then filter down to only the data you want inside your map task. Other possibilities would be creating a custom filter to do the filtering on the server sid

Re: [ANNOUNCE] Apache HBase 1.0.0 is now available for download

2015-02-24 Thread Dave Latham
What a milestone! Congratulations to the HBase developer community and everyone who worked to make this happen. HBase has come a long way over the years. On Tue, Feb 24, 2015 at 12:28 AM, Enis Söztutar wrote: > The HBase Team is pleased to announce the immediate release of HBase 1.0.0. > Downl

Re: Does hbase WAL ensures no data loss?

2015-02-16 Thread Dave Latham
Hi Hongbin, The WAL class is used internally to the region server. Typically an HBase write operation will first call WAL.append() with the data, then later, after releasing locks, call WAL.sync() to ensure that the data for that write has been synced to be durable before returning to the client

Re: Migrating data from HBase 0.94 to 0.98

2015-02-10 Thread Dave Latham
There's also a patch at https://issues.apache.org/jira/browse/HBASE-12814 to allow you to run replication between a 0.94 cluster and a 0.98 cluster. With that you can get the data setup in both 0.94 clusters, then upgrade one at a time. Dave On Mon, Feb 9, 2015 at 4:46 AM, Hayden Marchant wrote:

Re: HBase backup, recovery, replication et al

2013-10-17 Thread Dave Latham
"hbase book"? > > We are on CDH4.4 - HBase 0.94.6, so I think we are good there. > > Thanks for your time Dave. > > Harshad > > > On Thu, Oct 17, 2013 at 2:39 PM, Dave Latham wrote: > > > We're running HBase replication successfully on a 500 TB (c

Re: HBase backup, recovery, replication et al

2013-10-17 Thread Dave Latham
We're running HBase replication successfully on a 500 TB (compressed - raw is about 2PB) cluster over a 60ms link across the country. I'd give it a thumbs up for dealing with loss of a cluster and being able to run applications in two places that can tolerate inconsistency from the asynchronous na

Re: Limit number of columns in column family

2013-09-23 Thread Dave Latham
What about having all columns in the column family use the same qualifier and then setting the max versions for that column family to limit it? http://hbase.apache.org/book.html#schema.versions It would only work if you didn't need to do updates to the cell without knowing its timestamp or having

Re: Tables gets Major Compacted even if they haven't changed

2013-09-10 Thread Dave Latham
Major compactions can still be useful to improve locality - could we add a condition to check for that too? On Mon, Sep 9, 2013 at 10:41 PM, lars hofhansl wrote: > Interesting. I guess we could add a check to avoid major compactions if > (1) no TTL is set or we can show that all data is newer a

Re: data loss after cluster wide power loss

2013-07-01 Thread Dave Latham
On Mon, Jul 1, 2013 at 4:52 PM, Azuryy Yu wrote: > how to enable "sync on block close" in HDFS? > Set dfs.datanode.synconclose to true See https://issues.apache.org/jira/browse/HDFS-1539

Re: data loss after cluster wide power loss

2013-07-01 Thread Dave Latham
sh Srinivas wrote: > Yes this is a known issue. > > The HDFS part of this was addressed in > https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not > available in 1.x release. I think HBase does not use this API yet. > > > On Mon, Jul 1, 2013 at 3:00 PM, Da

Re: When replication is stopped, .oldlogs is never cleaned

2013-02-27 Thread Dave Latham
On Tue, Feb 26, 2013 at 4:23 PM, Jean-Daniel Cryans wrote: > Well the rest of the logic is part of the replication code, so > logically I think it needs to be disabled too if you kill replication. > It leaves us with the choice of keeping the logs around or not. If you > think the former is danger

Re: When replication is stopped, .oldlogs is never cleaned

2013-02-26 Thread Dave Latham
me you should never have to stay on stop_replication > more than a few minutes, either you'll continue replicating, you drop > the peer, or you disable that peer. > > FWIW setting hbase.replication to true with no peers should achieve > what you want, no need to call stop_replica

When replication is stopped, .oldlogs is never cleaned

2013-02-26 Thread Dave Latham
We have been preparing to enable replication between two large clusters. For the past couple of weeks, replication has been enabled via hbase-site.xml, but the replication state has been false (set false by issuing a stop_replication command). The master is no longer cleaning any logs from /hbase/

Re: CatalogJanitor: REGIONINFO_QUALIFIER is empty

2013-02-25 Thread Dave Latham
We recently saw some of these warnings in a cluster we were setting up. These warnings mean there are rows in the META table that are missing one of the expected columns. In our case, we verified that these regions didn't appear to exist in HDFS either and the table itself showed no holes or probl

Re: Using doubles and longs as ordering row values

2012-11-05 Thread Dave Latham
This fork looks a bit more up to date: https://github.com/ndimiduk/orderly On Mon, Nov 5, 2012 at 4:26 PM, Dave Latham wrote: > Here's a project to deal with this issue specifically. I'm not sure of > it's status: > https://github.com/conikeec/orderly > > > On

Re: Using doubles and longs as ordering row values

2012-11-05 Thread Dave Latham
Here's a project to deal with this issue specifically. I'm not sure of it's status: https://github.com/conikeec/orderly On Mon, Nov 5, 2012 at 4:01 PM, lars hofhansl wrote: > Have a look at the lily library. It has code to encode Longs/Doubles into > bytes such that resulting bytes sort as expe

Re: scaling a low latency service with HBase

2012-10-22 Thread Dave Latham
On Fri, Oct 19, 2012 at 5:22 PM, Amandeep Khurana wrote: > Answers inline > > On Fri, Oct 19, 2012 at 4:31 PM, Dave Latham wrote: > >> I need to scale an internal service / datastore that is currently hosted on >> an HBase cluster and wanted to ask for advice from

Re: scaling a low latency service with HBase

2012-10-22 Thread Dave Latham
antages when it came to SSDs? > > If your data lookups exhibits temporal locality, external, client side cache > pools may help. > > My 2c, > Abhishek > > > -Original Message- > From: ddlat...@gmail.com [mailto:ddlat...@gmail.com] On Behalf Of Dave Latham > S

Re: ANN: HBase 0.92.0 is available for download

2012-01-23 Thread Dave Latham
Woohoo! Many thanks to everyone who contributed to this big release. One of HBase's biggest strengths is its community. Stack, the link to the upgrade guide doesn't seem to be working, and I don't see any information on the page about upgrading to 0.92. Dave On Mon, Jan 23, 2012 at 3:57 PM, St

Re: corrupt WAL and Java Heap Space...

2011-08-26 Thread Dave Latham
We just hit the same issue. I attached log snippets from the regionserver and master into https://issues.apache.org/jira/browse/HBASE-4107 I was able to get the log file out of hdfs. Is there a location I can put it back in to have it picked up? Dave On Fri, Jul 15, 2011 at 12:23 PM, Andy Saut

Re: How to efficiently join HBase tables?

2011-06-08 Thread Dave Latham
I believe this is what Eran is suggesting: Table A --- Row1 (has joinVal_1) Row2 (has joinVal_2) Row3 (has joinVal_1) Table B --- Row4 (has joinVal_1) Row5 (has joinVal_3) Row6 (has joinVal_2) Mapper receives a list of input rows (union of both input tables in any order), and produces (=

Re: Log4j changes not working inside static mapper and reducer classes

2011-05-25 Thread Dave Latham
I'd recommend adding -Dlog4j.debug to the JVM args for any JVM that's not giving you what you expect. In this case, if it's the map/reduce tasks, add it to mapred.child.java.opts in mapred-site.xml. It should show you what configuration log4j is actually picking up. Dave On Wed, May 25, 2011 at

Re: Mapreduce log question

2011-05-25 Thread Dave Latham
Are you using TableInputFormat? If so, if you turn on DEBUG level logging for hbase (or just org.apache.hadoop.hbase.mapreduce.TableInputFormatBase) you should see lines like this, giving the map task number, region location, start row, and end row: getSplits: split -> 0 -> hslave107:,@G\xA0\xFB\

Re: 0.90.3

2011-05-24 Thread Dave Latham
Are you using the graceful_stop script? In 0.90.3 the bin/graceful_stop.sh script was updated to disable the master's balancer. However, it doesn't seem that anything re-enables it, so if you're using it you need to re-enable it on your own. See the book for more details: http://hbase.apache.org

rollback to 0.20

2011-04-27 Thread Dave Latham
The HBase book ( http://hbase.apache.org/book/upgrading.html ) states, > This version of 0.90.x HBase can be started on data written by HBase 0.20.x > or HBase 0.89.x. There is no > need of a migration step. HBase 0.89.x and 0.90.x does write out the name of > region directories differently -- >

Re: intersection of row ids

2011-03-11 Thread Dave Latham
If the ordering of the row ids is the same in both tables and both are of the same order of magnitude of size, I would recommend opening scanners on both tables, then compare the current row in each scanner, and advance whichever scanner is behind. Whenever you hit a match, you output it and advan