How to set Timeout for get/scan operations without impacting others
Hi, I need to set tight timeout for get/scan operations and I think HBase Client already support it. I found three related keys: - hbase.client.operation.timeout - hbase.rpc.timeout - hbase.client.retries.number What's the difference between hbase.client.operation.timeout and hbase.rpc.timeout? My understanding is that hbase.rpc.timeout has larger scope than hbase. client.operation.timeout, so setting hbase.client.operation.timeout is safer. Am I correct? And any other property keys I can uses? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Need Help: RegionTooBusyException: Above memstore limit
The error disappeared after changing write buffer from 20MB to 2MB. Thanks for the help! Jianshi On Wed, Mar 4, 2015 at 12:12 AM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > It depends on how you manage your connection, your table and your puts. If > it works for you with reducing the batch buffer size, then just keep it the > way it is... > > JM > > 2015-03-03 11:10 GMT-05:00 Jianshi Huang : > > > Yes, looks like reducing the batch buffer size works (still validating). > > > > But why setAutoFlush(false) is harmful here? I just want maximum write > > speed. > > > > Jianshi > > > > On Tue, Mar 3, 2015 at 10:54 PM, Jean-Marc Spaggiari < > > jean-m...@spaggiari.org> wrote: > > > > > Let HBase manage the flushes for you. Remove > > edgeTable.setAutoFlush(false) > > > and maybe reduce your batch size. > > > > > > I don't think that increasing the memstore is the good way to go. Sound > > > more like a plaster on the issue than a good fix (for me). > > > > > > JM > > > > > > 2015-03-03 9:43 GMT-05:00 Ted Yu : > > > > > > > Default value for hbase.regionserver.global.memstore.size is 0.4 > > > > > > > > Meaning Maximum size of all memstores in the region server before new > > > > updates > > > > are blocked and flushes are forced is 7352m which is lower than 774m. > > > > > > > > You can increase the value for > hbase.regionserver.global.memstore.size > > > > > > > > Please also see if you can distribute the writes to the underlying > > region > > > > so that the region's use of memstore comes down. > > > > > > > > Cheersx > > > > > > > > On Tue, Mar 3, 2015 at 12:07 AM, Jianshi Huang < > > jianshi.hu...@gmail.com> > > > > wrote: > > > > > > > > > Hi Ted, > > > > > > > > > > Only one region server is problematic. > > > > > > > > > > hbase.regionserver.global.memstore.size is not set, the problematic > > > > region > > > > > is using 774m for memstore. > > > > > > > > > > Max heap is 18380m for all region servers. > > > > > > > > > > Jianshi > > > > > > > > > > > > > > > On Mon, Mar 2, 2015 at 10:59 PM, Ted Yu > wrote: > > > > > > > > > > > What's the value for hbase.regionserver.global.memstore.size ? > > > > > > > > > > > > Did RegionTooBusyException happen to many regions or only a few > > > > regions ? > > > > > > > > > > > > How much heap did you give region servers ? > > > > > > > > > > > > bq. HBase version is 0.98.0.2.1.2.0-402 > > > > > > > > > > > > Yeah, this is a bit old. Please consider upgrading. > > > > > > > > > > > > Cheers > > > > > > > > > > > > On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang < > > > jianshi.hu...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I'm constantly facing "RegionTooBusyException: Above memstore > > > limit" > > > > > > errors > > > > > > > in one region server when writing data to HBase. > > > > > > > > > > > > > > I checked the region server log, and I've seen a lot of > warnings > > > > during > > > > > > the > > > > > > > data writes: > > > > > > > > > > > > > > WARN wal.fshlog couldn't find oldest seqNum for the region > > we're > > > > > about > > > > > > to > > > > > > > flush, ... > > > > > > > > > > > > > > Then HBase seem to flush the data and added it as a HStore > file. > > > > > > > > > > > > > > I also get a few warnings in client.ShortCircuitCache, says > > "could > > > > not > > > > > > load > > > > > > > ... due to InvalidToken exceptions. > > > > > > > > > > > > > > Anyone can give me hint what went wrong? > > > > > > > > > > > > > > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but > > the > > > > > > release > > > > > > > is a little bit old. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > -- > > > > > > > Jianshi Huang > > > > > > > > > > > > > > LinkedIn: jianshi > > > > > > > Twitter: @jshuang > > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Jianshi Huang > > > > > > > > > > LinkedIn: jianshi > > > > > Twitter: @jshuang > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Need Help: RegionTooBusyException: Above memstore limit
Yes, looks like reducing the batch buffer size works (still validating). But why setAutoFlush(false) is harmful here? I just want maximum write speed. Jianshi On Tue, Mar 3, 2015 at 10:54 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Let HBase manage the flushes for you. Remove edgeTable.setAutoFlush(false) > and maybe reduce your batch size. > > I don't think that increasing the memstore is the good way to go. Sound > more like a plaster on the issue than a good fix (for me). > > JM > > 2015-03-03 9:43 GMT-05:00 Ted Yu : > > > Default value for hbase.regionserver.global.memstore.size is 0.4 > > > > Meaning Maximum size of all memstores in the region server before new > > updates > > are blocked and flushes are forced is 7352m which is lower than 774m. > > > > You can increase the value for hbase.regionserver.global.memstore.size > > > > Please also see if you can distribute the writes to the underlying region > > so that the region's use of memstore comes down. > > > > Cheersx > > > > On Tue, Mar 3, 2015 at 12:07 AM, Jianshi Huang > > wrote: > > > > > Hi Ted, > > > > > > Only one region server is problematic. > > > > > > hbase.regionserver.global.memstore.size is not set, the problematic > > region > > > is using 774m for memstore. > > > > > > Max heap is 18380m for all region servers. > > > > > > Jianshi > > > > > > > > > On Mon, Mar 2, 2015 at 10:59 PM, Ted Yu wrote: > > > > > > > What's the value for hbase.regionserver.global.memstore.size ? > > > > > > > > Did RegionTooBusyException happen to many regions or only a few > > regions ? > > > > > > > > How much heap did you give region servers ? > > > > > > > > bq. HBase version is 0.98.0.2.1.2.0-402 > > > > > > > > Yeah, this is a bit old. Please consider upgrading. > > > > > > > > Cheers > > > > > > > > On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang < > jianshi.hu...@gmail.com > > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I'm constantly facing "RegionTooBusyException: Above memstore > limit" > > > > errors > > > > > in one region server when writing data to HBase. > > > > > > > > > > I checked the region server log, and I've seen a lot of warnings > > during > > > > the > > > > > data writes: > > > > > > > > > > WARN wal.fshlog couldn't find oldest seqNum for the region we're > > > about > > > > to > > > > > flush, ... > > > > > > > > > > Then HBase seem to flush the data and added it as a HStore file. > > > > > > > > > > I also get a few warnings in client.ShortCircuitCache, says "could > > not > > > > load > > > > > ... due to InvalidToken exceptions. > > > > > > > > > > Anyone can give me hint what went wrong? > > > > > > > > > > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the > > > > release > > > > > is a little bit old. > > > > > > > > > > Thanks, > > > > > > > > > > -- > > > > > Jianshi Huang > > > > > > > > > > LinkedIn: jianshi > > > > > Twitter: @jshuang > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > -- > > > Jianshi Huang > > > > > > LinkedIn: jianshi > > > Twitter: @jshuang > > > Github & Blog: http://huangjs.github.com/ > > > > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Need Help: RegionTooBusyException: Above memstore limit
Hi JM, Thanks for the hints. Here's my settings for writer. edgeTable.setAutoFlush(false) edgeTable.setWriteBufferSize(20971520) The write buffer seems quite large as the region server is hosting 12 related regions I'm writing to. I'll test with smaller write buffer size. The size of each put is between 10k~100k. Jianshi On Mon, Mar 2, 2015 at 11:04 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Jo Jianshi, > > Are you doing batch of puts? If so, what's the size of the batch and what's > the size of the puts? Are you trying to give a batch which at the end will > be bigger than the memstore size for a single RS? Can you try to reduce the > size of this batch? > > JM > > 2015-03-02 9:59 GMT-05:00 Ted Yu : > > > What's the value for hbase.regionserver.global.memstore.size ? > > > > Did RegionTooBusyException happen to many regions or only a few regions ? > > > > How much heap did you give region servers ? > > > > bq. HBase version is 0.98.0.2.1.2.0-402 > > > > Yeah, this is a bit old. Please consider upgrading. > > > > Cheers > > > > On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang > > wrote: > > > > > Hi, > > > > > > I'm constantly facing "RegionTooBusyException: Above memstore limit" > > errors > > > in one region server when writing data to HBase. > > > > > > I checked the region server log, and I've seen a lot of warnings during > > the > > > data writes: > > > > > > WARN wal.fshlog couldn't find oldest seqNum for the region we're > about > > to > > > flush, ... > > > > > > Then HBase seem to flush the data and added it as a HStore file. > > > > > > I also get a few warnings in client.ShortCircuitCache, says "could not > > load > > > ... due to InvalidToken exceptions. > > > > > > Anyone can give me hint what went wrong? > > > > > > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the > > release > > > is a little bit old. > > > > > > Thanks, > > > > > > -- > > > Jianshi Huang > > > > > > LinkedIn: jianshi > > > Twitter: @jshuang > > > Github & Blog: http://huangjs.github.com/ > > > > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Need Help: RegionTooBusyException: Above memstore limit
Hi Ted, Only one region server is problematic. hbase.regionserver.global.memstore.size is not set, the problematic region is using 774m for memstore. Max heap is 18380m for all region servers. Jianshi On Mon, Mar 2, 2015 at 10:59 PM, Ted Yu wrote: > What's the value for hbase.regionserver.global.memstore.size ? > > Did RegionTooBusyException happen to many regions or only a few regions ? > > How much heap did you give region servers ? > > bq. HBase version is 0.98.0.2.1.2.0-402 > > Yeah, this is a bit old. Please consider upgrading. > > Cheers > > On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang > wrote: > > > Hi, > > > > I'm constantly facing "RegionTooBusyException: Above memstore limit" > errors > > in one region server when writing data to HBase. > > > > I checked the region server log, and I've seen a lot of warnings during > the > > data writes: > > > > WARN wal.fshlog couldn't find oldest seqNum for the region we're about > to > > flush, ... > > > > Then HBase seem to flush the data and added it as a HStore file. > > > > I also get a few warnings in client.ShortCircuitCache, says "could not > load > > ... due to InvalidToken exceptions. > > > > Anyone can give me hint what went wrong? > > > > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the > release > > is a little bit old. > > > > Thanks, > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Need Help: RegionTooBusyException: Above memstore limit
Hi, I'm constantly facing "RegionTooBusyException: Above memstore limit" errors in one region server when writing data to HBase. I checked the region server log, and I've seen a lot of warnings during the data writes: WARN wal.fshlog couldn't find oldest seqNum for the region we're about to flush, ... Then HBase seem to flush the data and added it as a HStore file. I also get a few warnings in client.ShortCircuitCache, says "could not load ... due to InvalidToken exceptions. Anyone can give me hint what went wrong? My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the release is a little bit old. Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: setCompactionEnabled(false) seems ignored by HBase (0.98)
Thanks Stack, Will check the logs for reason. I'm only disabling compaction during dynamic splits (~10 mins), so it's acceptable in my case. Thanks, Jianshi On Wed, Jan 7, 2015 at 1:37 AM, Stack wrote: > On Mon, Jan 5, 2015 at 11:00 PM, Jianshi Huang > wrote: > > > Hi, > > > > Firstly, I found it strange that when I added a new split to a table and > do > > admin.move, it will trigger a MAJOR compaction for the whole table. > > > > Usually, a compactions says what provoked it in the log and why it a major > compaction. > > Splits and moves are not hooked up to force a major compaction so check > logs to see what brought on the compaction. Rather than wholesale disable > compactions -- probably a bad idea -- you are probably better off trying to > tune what triggers compactions in your workload. > > St.Ack > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: setCompactionEnabled(false) seems ignored by HBase (0.98)
Ah! I need run admin.modifyTable(tableNameBytes, tableDescriptor) Will try it soon... Jianshi On Tue, Jan 6, 2015 at 11:12 PM, Ted Yu wrote: > This is what setCompactionEnabled() does: > > public HTableDescriptor setCompactionEnabled(final boolean isEnable) { > > setValue(COMPACTION_ENABLED_KEY, isEnable ? TRUE : FALSE); > > return this; > > FYI > > On Mon, Jan 5, 2015 at 11:00 PM, Jianshi Huang > wrote: > > > Hi, > > > > Firstly, I found it strange that when I added a new split to a table and > do > > admin.move, it will trigger a MAJOR compaction for the whole table. > > > > So I tried to disable compaction before adding splits, > > > > > > admin.getTableDescriptor(tableNameBytes).setCompactionEnabled(false) > > > > However, MAJOR compaction is still triggered, looks like the flag is > > ignored by HBase? Do I need to (have to) disable the table first? > > > > Cheers, > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
setCompactionEnabled(false) seems ignored by HBase (0.98)
Hi, Firstly, I found it strange that when I added a new split to a table and do admin.move, it will trigger a MAJOR compaction for the whole table. So I tried to disable compaction before adding splits, admin.getTableDescriptor(tableNameBytes).setCompactionEnabled(false) However, MAJOR compaction is still triggered, looks like the flag is ignored by HBase? Do I need to (have to) disable the table first? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Storing JSON in HBase value cell, which serialization format is most compact?
Oh, that article, I've read that before. I'm using the approach that using a single KV to hold all my columns (mostly readonly). So conclusion: saving in disk space is not that huge one HBase column per colomn: 1,350,483 1000 SNAPPY DIFF vs one HBase column for all columns: 1,119,330 1000 SNAPPY DIFF Only about 15% However, the article suggested that the saving over the network wire is huge. 6,293,670 1000 NONE NONE vs 1,374,465 1000 NONE NONE Thanks again for the help! Jianshi On Fri, Nov 14, 2014 at 12:12 PM, Ted Yu wrote: > w.r.t. the effect of data block encoding on HFile size, take a look at Doug > Meil's blog 'The Effect of ColumnFamily, RowKey and KeyValue Design on > HFile Size': > http://blogs.apache.org/hbase/ > > Cheers > > On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang > wrote: > > > Thanks Ram, > > > > How about Prefix Tree based encoding then? HBASE-4676 > > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also > possible > > to do suffix tries? Then it could be a nice fit for JSON String (or any > > long value where changes are small). > > > > Maybe I should just flatten JSON to columns, hmm...what's the overhead > for > > a column? > > > > Jianshi > > > > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan < > > ramkrishna.s.vasude...@gmail.com> wrote: > > > > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for > > > value > > > cell? > > > No that is not possible now. All the encoding is per KV only. > > > But what you say is definitely worth trying. > > > > > > >>So would you recommend storing JSON flattened as many columns? > > > May be yes. But I have practically not used JSON formats so I may not > be > > > the best person to comment on this. > > > > > > Regards > > > Ram > > > > > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > Thanks Ram, > > > > > > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF for > > > value > > > > cell? > > > > > > > > So would you recommend storing JSON flattened as many columns? > > > > > > > > Jianshi > > > > > > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan < > > > > ramkrishna.s.vasude...@gmail.com> wrote: > > > > > > > > > Hi > > > > > > > > > > >> Since I'm storing > > > > > historical data (snapshot data) and changes between adjacent value > > > cells > > > > > are relatively small. > > > > > > > > > > If the values are changing even if it is smaller the FASTDIFF will > > > > rewrite > > > > > the value part. Only if there are exact matches then it would skip > > the > > > > > value part. JFYI. > > > > > > > > > > Regards > > > > > Ram > > > > > > > > > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang < > > > jianshi.hu...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > I thought FASTDIFF was only for rowkey and columns, great if it > > also > > > > > works > > > > > > in value cell. > > > > > > > > > > > > And thanks for the bjson link! > > > > > > > > > > > > Jianshi > > > > > > > > > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu > > wrote: > > > > > > > > > > > > > There is FASTDIFF data block encoding. > > > > > > > > > > > > > > See also http://bjson.org/ > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang < > > > jianshi.hu...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I'm currently saving JSON in pure String format in the value > > cell > > > > and > > > > > > > > depends on HBase' block compression to reduce the overhead of > > > JSON. > > > > > > &
Re: Storing JSON in HBase value cell, which serialization format is most compact?
But HDP 2.2 uses HDFS 2.6.0... very hard to convince our admins to upgrade. Would you recommend us to upgrade to 2.6.0? I'll ask them to consult HWX if you say yes. :) Jianshi On Fri, Nov 14, 2014 at 9:42 AM, Ted Yu wrote: > No. > The upcoming HDP 2.2 does have that fix. > > Cheers > > On Thu, Nov 13, 2014 at 5:38 PM, Jianshi Huang > wrote: > > > Oh, btw, is latest HDP 2.1(0.98.0.2.1.7.0-784-hadoop2) have this fix? > > > > Jianshi > > > > On Fri, Nov 14, 2014 at 9:37 AM, Jianshi Huang > > wrote: > > > > > Thanks Ted. > > > > > > I think the fix you mentioned is this one HBASE-12078 > > > <https://issues.apache.org/jira/browse/HBASE-12078>. > > > > > > Not sure when our Hadoop admin would upgrade it, ahhh > > > > > > Jianshi > > > > > > On Thu, Nov 13, 2014 at 11:15 PM, Ted Yu wrote: > > > > > >> Keep in mind that Prefix Tree encoding has higher overhead in write > path > > >> compared to other data block encoding methods. > > >> > > >> Please use 0.98.7 which has the latest fixes for Prefix Tree encoding. > > >> > > >> Cheers > > >> > > >> On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang < > jianshi.hu...@gmail.com > > > > > >> wrote: > > >> > > >> > Thanks Ram, > > >> > > > >> > How about Prefix Tree based encoding then? HBASE-4676 > > >> > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also > > >> possible > > >> > to do suffix tries? Then it could be a nice fit for JSON String (or > > any > > >> > long value where changes are small). > > >> > > > >> > Maybe I should just flatten JSON to columns, hmm...what's the > overhead > > >> for > > >> > a column? > > >> > > > >> > Jianshi > > >> > > > >> > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan < > > >> > ramkrishna.s.vasude...@gmail.com> wrote: > > >> > > > >> > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF > > for > > >> > > value > > >> > > cell? > > >> > > No that is not possible now. All the encoding is per KV only. > > >> > > But what you say is definitely worth trying. > > >> > > > > >> > > >>So would you recommend storing JSON flattened as many columns? > > >> > > May be yes. But I have practically not used JSON formats so I may > > >> not be > > >> > > the best person to comment on this. > > >> > > > > >> > > Regards > > >> > > Ram > > >> > > > > >> > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang < > > >> jianshi.hu...@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > Thanks Ram, > > >> > > > > > >> > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF > > for > > >> > > value > > >> > > > cell? > > >> > > > > > >> > > > So would you recommend storing JSON flattened as many columns? > > >> > > > > > >> > > > Jianshi > > >> > > > > > >> > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan < > > >> > > > ramkrishna.s.vasude...@gmail.com> wrote: > > >> > > > > > >> > > > > Hi > > >> > > > > > > >> > > > > >> Since I'm storing > > >> > > > > historical data (snapshot data) and changes between adjacent > > value > > >> > > cells > > >> > > > > are relatively small. > > >> > > > > > > >> > > > > If the values are changing even if it is smaller the FASTDIFF > > will > > >> > > > rewrite > > >> > > > > the value part. Only if there are exact matches then it would > > >> skip > > >> > the > > >> > > > > value part. JFYI. > > >> > > > > > > >> > > > > Regards > > >> > > > > Ram > > >> > > > &g
Re: Storing JSON in HBase value cell, which serialization format is most compact?
Oh, btw, is latest HDP 2.1(0.98.0.2.1.7.0-784-hadoop2) have this fix? Jianshi On Fri, Nov 14, 2014 at 9:37 AM, Jianshi Huang wrote: > Thanks Ted. > > I think the fix you mentioned is this one HBASE-12078 > <https://issues.apache.org/jira/browse/HBASE-12078>. > > Not sure when our Hadoop admin would upgrade it, ahhh > > Jianshi > > On Thu, Nov 13, 2014 at 11:15 PM, Ted Yu wrote: > >> Keep in mind that Prefix Tree encoding has higher overhead in write path >> compared to other data block encoding methods. >> >> Please use 0.98.7 which has the latest fixes for Prefix Tree encoding. >> >> Cheers >> >> On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang >> wrote: >> >> > Thanks Ram, >> > >> > How about Prefix Tree based encoding then? HBASE-4676 >> > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also >> possible >> > to do suffix tries? Then it could be a nice fit for JSON String (or any >> > long value where changes are small). >> > >> > Maybe I should just flatten JSON to columns, hmm...what's the overhead >> for >> > a column? >> > >> > Jianshi >> > >> > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan < >> > ramkrishna.s.vasude...@gmail.com> wrote: >> > >> > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for >> > > value >> > > cell? >> > > No that is not possible now. All the encoding is per KV only. >> > > But what you say is definitely worth trying. >> > > >> > > >>So would you recommend storing JSON flattened as many columns? >> > > May be yes. But I have practically not used JSON formats so I may >> not be >> > > the best person to comment on this. >> > > >> > > Regards >> > > Ram >> > > >> > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang < >> jianshi.hu...@gmail.com> >> > > wrote: >> > > >> > > > Thanks Ram, >> > > > >> > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF for >> > > value >> > > > cell? >> > > > >> > > > So would you recommend storing JSON flattened as many columns? >> > > > >> > > > Jianshi >> > > > >> > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan < >> > > > ramkrishna.s.vasude...@gmail.com> wrote: >> > > > >> > > > > Hi >> > > > > >> > > > > >> Since I'm storing >> > > > > historical data (snapshot data) and changes between adjacent value >> > > cells >> > > > > are relatively small. >> > > > > >> > > > > If the values are changing even if it is smaller the FASTDIFF will >> > > > rewrite >> > > > > the value part. Only if there are exact matches then it would >> skip >> > the >> > > > > value part. JFYI. >> > > > > >> > > > > Regards >> > > > > Ram >> > > > > >> > > > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang < >> > > jianshi.hu...@gmail.com >> > > > > >> > > > > wrote: >> > > > > >> > > > > > I thought FASTDIFF was only for rowkey and columns, great if it >> > also >> > > > > works >> > > > > > in value cell. >> > > > > > >> > > > > > And thanks for the bjson link! >> > > > > > >> > > > > > Jianshi >> > > > > > >> > > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu >> > wrote: >> > > > > > >> > > > > > > There is FASTDIFF data block encoding. >> > > > > > > >> > > > > > > See also http://bjson.org/ >> > > > > > > >> > > > > > > Cheers >> > > > > > > >> > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang < >> > > jianshi.hu...@gmail.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > >
Re: Storing JSON in HBase value cell, which serialization format is most compact?
Thanks Ted. I think the fix you mentioned is this one HBASE-12078 <https://issues.apache.org/jira/browse/HBASE-12078>. Not sure when our Hadoop admin would upgrade it, ahhh Jianshi On Thu, Nov 13, 2014 at 11:15 PM, Ted Yu wrote: > Keep in mind that Prefix Tree encoding has higher overhead in write path > compared to other data block encoding methods. > > Please use 0.98.7 which has the latest fixes for Prefix Tree encoding. > > Cheers > > On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang > wrote: > > > Thanks Ram, > > > > How about Prefix Tree based encoding then? HBASE-4676 > > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also > possible > > to do suffix tries? Then it could be a nice fit for JSON String (or any > > long value where changes are small). > > > > Maybe I should just flatten JSON to columns, hmm...what's the overhead > for > > a column? > > > > Jianshi > > > > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan < > > ramkrishna.s.vasude...@gmail.com> wrote: > > > > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for > > > value > > > cell? > > > No that is not possible now. All the encoding is per KV only. > > > But what you say is definitely worth trying. > > > > > > >>So would you recommend storing JSON flattened as many columns? > > > May be yes. But I have practically not used JSON formats so I may not > be > > > the best person to comment on this. > > > > > > Regards > > > Ram > > > > > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > Thanks Ram, > > > > > > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF for > > > value > > > > cell? > > > > > > > > So would you recommend storing JSON flattened as many columns? > > > > > > > > Jianshi > > > > > > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan < > > > > ramkrishna.s.vasude...@gmail.com> wrote: > > > > > > > > > Hi > > > > > > > > > > >> Since I'm storing > > > > > historical data (snapshot data) and changes between adjacent value > > > cells > > > > > are relatively small. > > > > > > > > > > If the values are changing even if it is smaller the FASTDIFF will > > > > rewrite > > > > > the value part. Only if there are exact matches then it would skip > > the > > > > > value part. JFYI. > > > > > > > > > > Regards > > > > > Ram > > > > > > > > > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang < > > > jianshi.hu...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > I thought FASTDIFF was only for rowkey and columns, great if it > > also > > > > > works > > > > > > in value cell. > > > > > > > > > > > > And thanks for the bjson link! > > > > > > > > > > > > Jianshi > > > > > > > > > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu > > wrote: > > > > > > > > > > > > > There is FASTDIFF data block encoding. > > > > > > > > > > > > > > See also http://bjson.org/ > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang < > > > jianshi.hu...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I'm currently saving JSON in pure String format in the value > > cell > > > > and > > > > > > > > depends on HBase' block compression to reduce the overhead of > > > JSON. > > > > > > > > > > > > > > > > I'm wondering if there's a more space efficient way to store > > > JSON? > > > > > > > > (there're lots of 0s and 1s, JSON String actually is an OK > > > format) > > > > > > > > > > > > > > > > I want to keep the value as a Map since the schema of source > > data > > > > > might > > > > > > > > change over time. > > > > > > > > > > > > > > > > Also is there a DIFF based encoding for values? Since I'm > > storing > > > > > > > > historical data (snapshot data) and changes between adjacent > > > value > > > > > > cells > > > > > > > > are relatively small. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -- > > > > > > > > Jianshi Huang > > > > > > > > > > > > > > > > LinkedIn: jianshi > > > > > > > > Twitter: @jshuang > > > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Storing JSON in HBase value cell, which serialization format is most compact?
Thanks Ram, How about Prefix Tree based encoding then? HBASE-4676 <https://issues.apache.org/jira/browse/HBASE-4676> says it's also possible to do suffix tries? Then it could be a nice fit for JSON String (or any long value where changes are small). Maybe I should just flatten JSON to columns, hmm...what's the overhead for a column? Jianshi On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan < ramkrishna.s.vasude...@gmail.com> wrote: > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for > value > cell? > No that is not possible now. All the encoding is per KV only. > But what you say is definitely worth trying. > > >>So would you recommend storing JSON flattened as many columns? > May be yes. But I have practically not used JSON formats so I may not be > the best person to comment on this. > > Regards > Ram > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang > wrote: > > > Thanks Ram, > > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF for > value > > cell? > > > > So would you recommend storing JSON flattened as many columns? > > > > Jianshi > > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan < > > ramkrishna.s.vasude...@gmail.com> wrote: > > > > > Hi > > > > > > >> Since I'm storing > > > historical data (snapshot data) and changes between adjacent value > cells > > > are relatively small. > > > > > > If the values are changing even if it is smaller the FASTDIFF will > > rewrite > > > the value part. Only if there are exact matches then it would skip the > > > value part. JFYI. > > > > > > Regards > > > Ram > > > > > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang < > jianshi.hu...@gmail.com > > > > > > wrote: > > > > > > > I thought FASTDIFF was only for rowkey and columns, great if it also > > > works > > > > in value cell. > > > > > > > > And thanks for the bjson link! > > > > > > > > Jianshi > > > > > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu wrote: > > > > > > > > > There is FASTDIFF data block encoding. > > > > > > > > > > See also http://bjson.org/ > > > > > > > > > > Cheers > > > > > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I'm currently saving JSON in pure String format in the value cell > > and > > > > > > depends on HBase' block compression to reduce the overhead of > JSON. > > > > > > > > > > > > I'm wondering if there's a more space efficient way to store > JSON? > > > > > > (there're lots of 0s and 1s, JSON String actually is an OK > format) > > > > > > > > > > > > I want to keep the value as a Map since the schema of source data > > > might > > > > > > change over time. > > > > > > > > > > > > Also is there a DIFF based encoding for values? Since I'm storing > > > > > > historical data (snapshot data) and changes between adjacent > value > > > > cells > > > > > > are relatively small. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Storing JSON in HBase value cell, which serialization format is most compact?
Thanks Ram, So is it possible to specify FASTDIFF for rowkey/column and DIFF for value cell? So would you recommend storing JSON flattened as many columns? Jianshi On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan < ramkrishna.s.vasude...@gmail.com> wrote: > Hi > > >> Since I'm storing > historical data (snapshot data) and changes between adjacent value cells > are relatively small. > > If the values are changing even if it is smaller the FASTDIFF will rewrite > the value part. Only if there are exact matches then it would skip the > value part. JFYI. > > Regards > Ram > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang > wrote: > > > I thought FASTDIFF was only for rowkey and columns, great if it also > works > > in value cell. > > > > And thanks for the bjson link! > > > > Jianshi > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu wrote: > > > > > There is FASTDIFF data block encoding. > > > > > > See also http://bjson.org/ > > > > > > Cheers > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang > > > wrote: > > > > > > > Hi, > > > > > > > > I'm currently saving JSON in pure String format in the value cell and > > > > depends on HBase' block compression to reduce the overhead of JSON. > > > > > > > > I'm wondering if there's a more space efficient way to store JSON? > > > > (there're lots of 0s and 1s, JSON String actually is an OK format) > > > > > > > > I want to keep the value as a Map since the schema of source data > might > > > > change over time. > > > > > > > > Also is there a DIFF based encoding for values? Since I'm storing > > > > historical data (snapshot data) and changes between adjacent value > > cells > > > > are relatively small. > > > > > > > > > > > > Thanks, > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Storing JSON in HBase value cell, which serialization format is most compact?
I thought FASTDIFF was only for rowkey and columns, great if it also works in value cell. And thanks for the bjson link! Jianshi On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu wrote: > There is FASTDIFF data block encoding. > > See also http://bjson.org/ > > Cheers > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang > wrote: > > > Hi, > > > > I'm currently saving JSON in pure String format in the value cell and > > depends on HBase' block compression to reduce the overhead of JSON. > > > > I'm wondering if there's a more space efficient way to store JSON? > > (there're lots of 0s and 1s, JSON String actually is an OK format) > > > > I want to keep the value as a Map since the schema of source data might > > change over time. > > > > Also is there a DIFF based encoding for values? Since I'm storing > > historical data (snapshot data) and changes between adjacent value cells > > are relatively small. > > > > > > Thanks, > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Storing JSON in HBase value cell, which serialization format is most compact?
Hi, I'm currently saving JSON in pure String format in the value cell and depends on HBase' block compression to reduce the overhead of JSON. I'm wondering if there's a more space efficient way to store JSON? (there're lots of 0s and 1s, JSON String actually is an OK format) I want to keep the value as a Map since the schema of source data might change over time. Also is there a DIFF based encoding for values? Since I'm storing historical data (snapshot data) and changes between adjacent value cells are relatively small. Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Is there a TableInputFormat implementation that supports multiple splits for each region
It seems each region is a split in current TableInputFormat. We have large regions and it's suboptimal. Is there a TableInputFormat implementation that supports multiple splits for each region? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Thanks for the explanation Qian! I think being able to balance empty regions is important and the preferred result to me. The best way so far is to manually 'balance' the regions if we need to add pre-splits dynamically. Jianshi On Tue, Sep 23, 2014 at 11:35 AM, Qiang Tian wrote: > Hello, I happened to got balancer related issues 2 months ago and looked at > that part, below is a summary: > 1)by default, hbase balancer(StochasticLoadBalancer by default) does not > balance regions per table. i.e. all regions are considered as 1 table. so > if you have many tables, especially some tables have empty regions, you > probably get unbalanced, the balancer probably not triggered at all. > this is got from code inspection, my problem failed to be reproduced later. > but it proved that deleting empty regions can trigger balancer correctly > and make regions well balanced. > > 2)there are some other reasons that balancer are not triggered. see > HMaster#balance. turn on debug can see related messages in master log. in > my case, it is not triggered because there are regions in transition: > LOG.debug("Not running balancer because " + regionsInTransition.size() + > " region(s) in transition: " + > org.apache.commons.lang.StringUtils. > abbreviate(regionsInTransition.toString(), 256)); > > the cause can be found in regionserver log file. > > 3)per-table balance can be set by "hbase.master.loadbalance.bytable", > however it looks not a good option when you have many tables - the master > will issue balance call for each table, one by one. > > 4)split region follows normal balancer process. so if you have issue in #1, > split does not help balance. it looks pre-split at table creation is fine, > which uses round-robin assignment. > > > > On Tue, Sep 23, 2014 at 2:12 AM, Bharath Vissapragada < > bhara...@cloudera.com > > wrote: > > > https://issues.apache.org/jira/browse/HBASE-11368 related to the > original > > issue too. > > > > On Mon, Sep 22, 2014 at 10:18 AM, Ted Yu wrote: > > > > > As you noted in the FIXME, there're some factors which should be > tackled > > by > > > balancer / assignment manager. > > > > > > Please continue digging up master log so that we can find the cause for > > > balancer not fulfilling your goal. > > > > > > Cheers > > > > > > On Mon, Sep 22, 2014 at 10:09 AM, Jianshi Huang < > jianshi.hu...@gmail.com > > > > > > wrote: > > > > > > > Ok, I fixed this by manually reassign region servers to newly created > > > ones. > > > > > > > > def reassignRegionServer(admin: HBaseAdmin, regions: > > Seq[HRegionInfo], > > > > regionServers: Seq[ServerName]): Unit = { > > > > val rand = new Random() > > > > regions.foreach { r => > > > > val idx = rand.nextInt(regionServers.size) > > > > val server = regionServers(idx) > > > > // FIXME: what if selected region server is dead? > > > > admin.move(r.getEncodedNameAsBytes, > > > > server.getServerName.getBytes("UTF8")) > > > > } > > > > } > > > > > > > > er... > > > > > > > > Jianshi > > > > > > > > On Tue, Sep 23, 2014 at 12:24 AM, Jianshi Huang < > > jianshi.hu...@gmail.com > > > > > > > > wrote: > > > > > > > > > Hmm...any workaround? I only want to do this: > > > > > > > > > > Rebalance the new regions *evenly* to all servers after manually > > adding > > > > > splits, so later bulk insertions won't cause contention. > > > > > > > > > > P.S. > > > > > Looks like two of the region servers which had majority of the > > regions > > > > > were down during Major compaction... I guess it had too much data. > > > > > > > > > > > > > > > Jianshi > > > > > > > > > > On Tue, Sep 23, 2014 at 12:13 AM, Jianshi Huang < > > > jianshi.hu...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > >> Yes, I have access to Master UI, however logs/*.log cannot be > opened > > > or > > > > >> downloaded, must be some security restrictions in the proxy... > > > > >> > > > > >> Jianshi > > > > >> > > > > >> On Tue, Sep 23,
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Ok, I fixed this by manually reassign region servers to newly created ones. def reassignRegionServer(admin: HBaseAdmin, regions: Seq[HRegionInfo], regionServers: Seq[ServerName]): Unit = { val rand = new Random() regions.foreach { r => val idx = rand.nextInt(regionServers.size) val server = regionServers(idx) // FIXME: what if selected region server is dead? admin.move(r.getEncodedNameAsBytes, server.getServerName.getBytes("UTF8")) } } er... Jianshi On Tue, Sep 23, 2014 at 12:24 AM, Jianshi Huang wrote: > Hmm...any workaround? I only want to do this: > > Rebalance the new regions *evenly* to all servers after manually adding > splits, so later bulk insertions won't cause contention. > > P.S. > Looks like two of the region servers which had majority of the regions > were down during Major compaction... I guess it had too much data. > > > Jianshi > > On Tue, Sep 23, 2014 at 12:13 AM, Jianshi Huang > wrote: > >> Yes, I have access to Master UI, however logs/*.log cannot be opened or >> downloaded, must be some security restrictions in the proxy... >> >> Jianshi >> >> On Tue, Sep 23, 2014 at 12:06 AM, Ted Yu wrote: >> >>> Do you have access to Master UI ? >>> >>> :60010/logs/ would show you list of log files. >>> >>> The you can view :60010/logs/hbase--master-XXX.log >>> >>> Cheers >>> >>> On Mon, Sep 22, 2014 at 9:00 AM, Jianshi Huang >>> wrote: >>> >>> > Ah... I don't have access to HMaster logs... I need to ask the admin. >>> > >>> > Jianshi >>> > >>> > On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu wrote: >>> > >>> > > bq. assign per-table balancer class >>> > > >>> > > No that I know of. >>> > > Can you pastebin master log involving output from balancer ? >>> > > >>> > > Cheers >>> > > >>> > > On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang < >>> jianshi.hu...@gmail.com> >>> > > wrote: >>> > > >>> > > > Hi Ted, >>> > > > >>> > > > I moved setBalancerRunning before balancer and run them twice. >>> However >>> > I >>> > > > still got highly skewed region distribution. >>> > > > >>> > > > I guess it's because of the StochasticLoadBalancer, can I assign >>> > > per-table >>> > > > balancer class in HBase? >>> > > > >>> > > > >>> > > > Jianshi >>> > > > >>> > > > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu >>> wrote: >>> > > > >>> > > > > admin.setBalancerRunning() call should precede the call to >>> > > > > admin.balancer(). >>> > > > > >>> > > > > You can inspect master log to see whether regions are being >>> moved off >>> > > the >>> > > > > heavily loaded server. >>> > > > > >>> > > > > Cheers >>> > > > > >>> > > > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang < >>> > > jianshi.hu...@gmail.com> >>> > > > > wrote: >>> > > > > >>> > > > > > Hi Ted and others, >>> > > > > > >>> > > > > > I did the following after adding splits (without data) to my >>> table, >>> > > > > however >>> > > > > > the region is still very imbalanced (one region server has 221 >>> > > regions >>> > > > > and >>> > > > > > other 50 region servers have about 4~8 regions each). >>> > > > > > >>> > > > > > admin.balancer() >>> > > > > > admin.setBalancerRunning(true, true) >>> > > > > > >>> > > > > > The balancer class in my HBase cluster is >>> > > > > > >>> > > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer >>> > > > > > >>> > > > > > So, is this behavior expected? Can I assign different balancer >>> > class >>> > > to >>> > > > > my >>> > > > > > tables (I don't h
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Hmm...any workaround? I only want to do this: Rebalance the new regions *evenly* to all servers after manually adding splits, so later bulk insertions won't cause contention. P.S. Looks like two of the region servers which had majority of the regions were down during Major compaction... I guess it had too much data. Jianshi On Tue, Sep 23, 2014 at 12:13 AM, Jianshi Huang wrote: > Yes, I have access to Master UI, however logs/*.log cannot be opened or > downloaded, must be some security restrictions in the proxy... > > Jianshi > > On Tue, Sep 23, 2014 at 12:06 AM, Ted Yu wrote: > >> Do you have access to Master UI ? >> >> :60010/logs/ would show you list of log files. >> >> The you can view :60010/logs/hbase--master-XXX.log >> >> Cheers >> >> On Mon, Sep 22, 2014 at 9:00 AM, Jianshi Huang >> wrote: >> >> > Ah... I don't have access to HMaster logs... I need to ask the admin. >> > >> > Jianshi >> > >> > On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu wrote: >> > >> > > bq. assign per-table balancer class >> > > >> > > No that I know of. >> > > Can you pastebin master log involving output from balancer ? >> > > >> > > Cheers >> > > >> > > On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang < >> jianshi.hu...@gmail.com> >> > > wrote: >> > > >> > > > Hi Ted, >> > > > >> > > > I moved setBalancerRunning before balancer and run them twice. >> However >> > I >> > > > still got highly skewed region distribution. >> > > > >> > > > I guess it's because of the StochasticLoadBalancer, can I assign >> > > per-table >> > > > balancer class in HBase? >> > > > >> > > > >> > > > Jianshi >> > > > >> > > > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu >> wrote: >> > > > >> > > > > admin.setBalancerRunning() call should precede the call to >> > > > > admin.balancer(). >> > > > > >> > > > > You can inspect master log to see whether regions are being moved >> off >> > > the >> > > > > heavily loaded server. >> > > > > >> > > > > Cheers >> > > > > >> > > > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang < >> > > jianshi.hu...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Hi Ted and others, >> > > > > > >> > > > > > I did the following after adding splits (without data) to my >> table, >> > > > > however >> > > > > > the region is still very imbalanced (one region server has 221 >> > > regions >> > > > > and >> > > > > > other 50 region servers have about 4~8 regions each). >> > > > > > >> > > > > > admin.balancer() >> > > > > > admin.setBalancerRunning(true, true) >> > > > > > >> > > > > > The balancer class in my HBase cluster is >> > > > > > >> > > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer >> > > > > > >> > > > > > So, is this behavior expected? Can I assign different balancer >> > class >> > > to >> > > > > my >> > > > > > tables (I don't have HBase admin permission)? Which one should I >> > use? >> > > > > > >> > > > > > I just want HBase to evenly distribute the regions even they >> don't >> > > have >> > > > > > data (that's the purpose of pre-split I think). >> > > > > > >> > > > > > >> > > > > > Jianshi >> > > > > > >> > > > > > >> > > > > > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu >> > wrote: >> > > > > > >> > > > > > > Yes. See the following method in HBaseAdmin: >> > > > > > > >> > > > > > > public boolean balancer() >> > > > > > > >> > > > > > > >> > > > > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang < >&g
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Yes, I have access to Master UI, however logs/*.log cannot be opened or downloaded, must be some security restrictions in the proxy... Jianshi On Tue, Sep 23, 2014 at 12:06 AM, Ted Yu wrote: > Do you have access to Master UI ? > > :60010/logs/ would show you list of log files. > > The you can view :60010/logs/hbase--master-XXX.log > > Cheers > > On Mon, Sep 22, 2014 at 9:00 AM, Jianshi Huang > wrote: > > > Ah... I don't have access to HMaster logs... I need to ask the admin. > > > > Jianshi > > > > On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu wrote: > > > > > bq. assign per-table balancer class > > > > > > No that I know of. > > > Can you pastebin master log involving output from balancer ? > > > > > > Cheers > > > > > > On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > Hi Ted, > > > > > > > > I moved setBalancerRunning before balancer and run them twice. > However > > I > > > > still got highly skewed region distribution. > > > > > > > > I guess it's because of the StochasticLoadBalancer, can I assign > > > per-table > > > > balancer class in HBase? > > > > > > > > > > > > Jianshi > > > > > > > > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu wrote: > > > > > > > > > admin.setBalancerRunning() call should precede the call to > > > > > admin.balancer(). > > > > > > > > > > You can inspect master log to see whether regions are being moved > off > > > the > > > > > heavily loaded server. > > > > > > > > > > Cheers > > > > > > > > > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang < > > > jianshi.hu...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Ted and others, > > > > > > > > > > > > I did the following after adding splits (without data) to my > table, > > > > > however > > > > > > the region is still very imbalanced (one region server has 221 > > > regions > > > > > and > > > > > > other 50 region servers have about 4~8 regions each). > > > > > > > > > > > > admin.balancer() > > > > > > admin.setBalancerRunning(true, true) > > > > > > > > > > > > The balancer class in my HBase cluster is > > > > > > > > > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > > > > > > > > > > > > So, is this behavior expected? Can I assign different balancer > > class > > > to > > > > > my > > > > > > tables (I don't have HBase admin permission)? Which one should I > > use? > > > > > > > > > > > > I just want HBase to evenly distribute the regions even they > don't > > > have > > > > > > data (that's the purpose of pre-split I think). > > > > > > > > > > > > > > > > > > Jianshi > > > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu > > wrote: > > > > > > > > > > > > > Yes. See the following method in HBaseAdmin: > > > > > > > > > > > > > > public boolean balancer() > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang < > > > > jianshi.hu...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Thanks Ted! > > > > > > > > > > > > > > > > Didn't know I still need to run the 'balancer' command. > > > > > > > > > > > > > > > > Is there a way to do it programmatically? > > > > > > > > > > > > > > > > Jianshi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu > > > > > wrote: > > > > > > > > > > >
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Ah... I don't have access to HMaster logs... I need to ask the admin. Jianshi On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu wrote: > bq. assign per-table balancer class > > No that I know of. > Can you pastebin master log involving output from balancer ? > > Cheers > > On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang > wrote: > > > Hi Ted, > > > > I moved setBalancerRunning before balancer and run them twice. However I > > still got highly skewed region distribution. > > > > I guess it's because of the StochasticLoadBalancer, can I assign > per-table > > balancer class in HBase? > > > > > > Jianshi > > > > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu wrote: > > > > > admin.setBalancerRunning() call should precede the call to > > > admin.balancer(). > > > > > > You can inspect master log to see whether regions are being moved off > the > > > heavily loaded server. > > > > > > Cheers > > > > > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > Hi Ted and others, > > > > > > > > I did the following after adding splits (without data) to my table, > > > however > > > > the region is still very imbalanced (one region server has 221 > regions > > > and > > > > other 50 region servers have about 4~8 regions each). > > > > > > > > admin.balancer() > > > > admin.setBalancerRunning(true, true) > > > > > > > > The balancer class in my HBase cluster is > > > > > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > > > > > > > > So, is this behavior expected? Can I assign different balancer class > to > > > my > > > > tables (I don't have HBase admin permission)? Which one should I use? > > > > > > > > I just want HBase to evenly distribute the regions even they don't > have > > > > data (that's the purpose of pre-split I think). > > > > > > > > > > > > Jianshi > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu wrote: > > > > > > > > > Yes. See the following method in HBaseAdmin: > > > > > > > > > > public boolean balancer() > > > > > > > > > > > > > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang < > > jianshi.hu...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > Thanks Ted! > > > > > > > > > > > > Didn't know I still need to run the 'balancer' command. > > > > > > > > > > > > Is there a way to do it programmatically? > > > > > > > > > > > > Jianshi > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu > > wrote: > > > > > > > > > > > > > After splitting the region, you may need to run balancer to > > spread > > > > the > > > > > > new > > > > > > > regions out. > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang < > > > > jianshi.hu...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Shahab, > > > > > > > > > > > > > > > > I see, that seems to be the right way... > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus < > > > > > shahab.yu...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Shahab > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jianshi Huang > > > > > > > > > > > > > > > > LinkedIn: jianshi > > > > > > > > Twitter: @jshuang > > > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Hi Ted, I moved setBalancerRunning before balancer and run them twice. However I still got highly skewed region distribution. I guess it's because of the StochasticLoadBalancer, can I assign per-table balancer class in HBase? Jianshi On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu wrote: > admin.setBalancerRunning() call should precede the call to > admin.balancer(). > > You can inspect master log to see whether regions are being moved off the > heavily loaded server. > > Cheers > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang > wrote: > > > Hi Ted and others, > > > > I did the following after adding splits (without data) to my table, > however > > the region is still very imbalanced (one region server has 221 regions > and > > other 50 region servers have about 4~8 regions each). > > > > admin.balancer() > > admin.setBalancerRunning(true, true) > > > > The balancer class in my HBase cluster is > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer > > > > So, is this behavior expected? Can I assign different balancer class to > my > > tables (I don't have HBase admin permission)? Which one should I use? > > > > I just want HBase to evenly distribute the regions even they don't have > > data (that's the purpose of pre-split I think). > > > > > > Jianshi > > > > > > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu wrote: > > > > > Yes. See the following method in HBaseAdmin: > > > > > > public boolean balancer() > > > > > > > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang > > > > wrote: > > > > > > > Thanks Ted! > > > > > > > > Didn't know I still need to run the 'balancer' command. > > > > > > > > Is there a way to do it programmatically? > > > > > > > > Jianshi > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu wrote: > > > > > > > > > After splitting the region, you may need to run balancer to spread > > the > > > > new > > > > > regions out. > > > > > > > > > > Cheers > > > > > > > > > > > > > > > On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang < > > jianshi.hu...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > Hi Shahab, > > > > > > > > > > > > I see, that seems to be the right way... > > > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus < > > > shahab.yu...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Shahab > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Hi Ted and others, I did the following after adding splits (without data) to my table, however the region is still very imbalanced (one region server has 221 regions and other 50 region servers have about 4~8 regions each). admin.balancer() admin.setBalancerRunning(true, true) The balancer class in my HBase cluster is org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer So, is this behavior expected? Can I assign different balancer class to my tables (I don't have HBase admin permission)? Which one should I use? I just want HBase to evenly distribute the regions even they don't have data (that's the purpose of pre-split I think). Jianshi On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu wrote: > Yes. See the following method in HBaseAdmin: > > public boolean balancer() > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang > wrote: > > > Thanks Ted! > > > > Didn't know I still need to run the 'balancer' command. > > > > Is there a way to do it programmatically? > > > > Jianshi > > > > > > > > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu wrote: > > > > > After splitting the region, you may need to run balancer to spread the > > new > > > regions out. > > > > > > Cheers > > > > > > > > > On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang > > > > wrote: > > > > > > > Hi Shahab, > > > > > > > > I see, that seems to be the right way... > > > > > > > > > > > > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus < > shahab.yu...@gmail.com> > > > > wrote: > > > > > > > > > Shahab > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?
Thanks Esteban for the suggestion. For case 2) KeyPrefixRegionSplitPolicy won't be enough I think as we're constantly adding new types so the #types is unknown at the beginning, and when there's a new type of data, it will add pre-splits [type|00, type|01, ..., type|FF] to the table. Data is ingested one type after another so if there's no auto-splits, ingestion will be too slow. For case 1) I thought about binning, however it makes scans in tableInputFormat more complicated. I think auto pre-splits can solve it so currently a sampling process is run to compute the splitKeys for every ts data to be ingested. Jianshi On Thu, Sep 18, 2014 at 3:19 AM, Esteban Gutierrez wrote: > Thanks Jianshi for that helpful information, > > I think for use case 1) it depends on the data ingestion rate when the > regions need to split. The synchronous split operation makes some sense > there if you want the regions to contain specific time ranges and/or > number of records. > > For use case 2) I think is a good match for the KeyPrefixRegionSplitPolicy > or DelimitedKeyPrefixRegionSplitPolicy. Since the regions will be split > based on the if type length is fixed or if the type is of varying > length but delimited with | > > On a second thought, it might be even possible to solve 1) with those > prefix based split policies if you use a prefix for your key that also > varies monotonically or can be passed by the client when it has reached > some threshold, e.g. after writing X billion data points, use prefix 001 > and next Y billion data rows use prefix 002 or something like that. > > cheers, > esteban. > > > -- > Cloudera, Inc. > > > On Wed, Sep 17, 2014 at 11:53 AM, Jianshi Huang > wrote: > > > Hi Esteban, > > > > Two reasons to split dynamically, > > > > 1) I have a column family that stores timeseries data for mapreduce > tasks, > > and the rowkey is monotonically increasing to make scanning easier. > > > > 2) (a better reason), I'm storing multiple types of data in the same > table, > > and I have about 500TB of data in total. That's many billions of rows and > > many thousands of regions. I want to make sure ingesting one type of data > > won't touch every region which will cause a lot of fragments and merge > > operations, the rowkey is designed as ||. > > > > So either way I would want a dynamic split in my design. > > > > Jianshi > > > > > > On Thu, Sep 18, 2014 at 2:39 AM, Esteban Gutierrez > > > wrote: > > > > > Jianshi, > > > > > > The retry is not an expected behavior that the client should be doing. > In > > > fact you don't want your clients to issue admin operations to the > cluster > > > ;) > > > > > > Shahab's option is the best alternative by polling when the number of > > > regions has changed in the table you want to modify the splits > > dynamically. > > > The JIRA that Ted suggested requires modification in the core table > > > operations to support sync operations and requires some major work to > do > > it > > > right. Ted's alternative to create the splits at table creation time is > > the > > > best option if you can pre-split IMHO. > > > > > > If you could elaborate more on the practical reasons you mention to > > create > > > synchronously those new regions that would be great for us. Maybe its > > > related to multi-tenancy but I'm just guessing :) > > > > > > esteban. > > > > > > > > > -- > > > Cloudera, Inc. > > > > > > > > > On Wed, Sep 17, 2014 at 11:09 AM, Ted Yu wrote: > > > > > > > Jianshi: > > > > See HBASE-11608 Add synchronous split > > > > > > > > bq. createTable does something special? > > > > > > > > Yes. See this in HBaseAdmin: > > > > > > > > public void createTable(final HTableDescriptor desc, byte [][] > > > splitKeys) > > > > > > > > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang < > > jianshi.hu...@gmail.com > > > > > > > > wrote: > > > > > > > > > I see Shahab, async makes sense, but I prefer that the HBase client > > > does > > > > > the retry for me, and let me specify a timeout parameter. > > > > > > > > > > One question, does that mean adding multiple splits into one region > > has > > > > to > > > > > be done sequentially? How can I add region splits
Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?
Hi Esteban, Two reasons to split dynamically, 1) I have a column family that stores timeseries data for mapreduce tasks, and the rowkey is monotonically increasing to make scanning easier. 2) (a better reason), I'm storing multiple types of data in the same table, and I have about 500TB of data in total. That's many billions of rows and many thousands of regions. I want to make sure ingesting one type of data won't touch every region which will cause a lot of fragments and merge operations, the rowkey is designed as ||. So either way I would want a dynamic split in my design. Jianshi On Thu, Sep 18, 2014 at 2:39 AM, Esteban Gutierrez wrote: > Jianshi, > > The retry is not an expected behavior that the client should be doing. In > fact you don't want your clients to issue admin operations to the cluster > ;) > > Shahab's option is the best alternative by polling when the number of > regions has changed in the table you want to modify the splits dynamically. > The JIRA that Ted suggested requires modification in the core table > operations to support sync operations and requires some major work to do it > right. Ted's alternative to create the splits at table creation time is the > best option if you can pre-split IMHO. > > If you could elaborate more on the practical reasons you mention to create > synchronously those new regions that would be great for us. Maybe its > related to multi-tenancy but I'm just guessing :) > > esteban. > > > -- > Cloudera, Inc. > > > On Wed, Sep 17, 2014 at 11:09 AM, Ted Yu wrote: > > > Jianshi: > > See HBASE-11608 Add synchronous split > > > > bq. createTable does something special? > > > > Yes. See this in HBaseAdmin: > > > > public void createTable(final HTableDescriptor desc, byte [][] > splitKeys) > > > > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang > > > wrote: > > > > > I see Shahab, async makes sense, but I prefer that the HBase client > does > > > the retry for me, and let me specify a timeout parameter. > > > > > > One question, does that mean adding multiple splits into one region has > > to > > > be done sequentially? How can I add region splits in parallel? Does > > > createTable does something special? > > > > > > > > > Jianshi > > > > > > > > > On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus > > > wrote: > > > > > > > Split is an async operation. When you call it, and the call returns, > it > > > > does not mean that the region has been created yet. > > > > > > > > So either you wait for a while (using Thread.sleep) or check for the > > > number > > > > of regions in a loop and until they have increased to the value you > > want > > > > and then access the region. The former is not a good idea, though you > > can > > > > try it out just to make sure that this is indeed the issue. > > > > > > > > What am I suggesting is something like (pseudo code): > > > > > > > > while(new#regions > old#regions) > > > > { > > > >new#regions = admin.getLatest#regions > > > > } > > > > > > > > Regards, > > > > Shahab > > > > > > > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang < > > jianshi.hu...@gmail.com> > > > > wrote: > > > > > > > > > I constantly get the following errors when I tried to add splits > to a > > > > > table. > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): > > > > > org.apache.hadoop.hbase.NotServingRegionException: Region > > > > > > > > > > > > > > > grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568 > > > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on > > > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359 > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > > >
Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?
You rock Ted, I would also add synchronous addSplits as well, there's no good reason multiple splits has to be done sequentially. I also checked createTable, and I trace the code here and lost track... executeCallable(new MasterCallable(getConnection()) { @Override public Void call() throws ServiceException { CreateTableRequest request = RequestConverter.buildCreateTableRequest(desc, splitKeys); master.createTable(null, request); return null; } }); So what happened in the handler of createTableRequest? Which part of code should I check? Jianshi On Thu, Sep 18, 2014 at 2:09 AM, Ted Yu wrote: > Jianshi: > See HBASE-11608 Add synchronous split > > bq. createTable does something special? > > Yes. See this in HBaseAdmin: > > public void createTable(final HTableDescriptor desc, byte [][] splitKeys) > > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang > wrote: > > > I see Shahab, async makes sense, but I prefer that the HBase client does > > the retry for me, and let me specify a timeout parameter. > > > > One question, does that mean adding multiple splits into one region has > to > > be done sequentially? How can I add region splits in parallel? Does > > createTable does something special? > > > > > > Jianshi > > > > > > On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus > > wrote: > > > > > Split is an async operation. When you call it, and the call returns, it > > > does not mean that the region has been created yet. > > > > > > So either you wait for a while (using Thread.sleep) or check for the > > number > > > of regions in a loop and until they have increased to the value you > want > > > and then access the region. The former is not a good idea, though you > can > > > try it out just to make sure that this is indeed the issue. > > > > > > What am I suggesting is something like (pseudo code): > > > > > > while(new#regions > old#regions) > > > { > > >new#regions = admin.getLatest#regions > > > } > > > > > > Regards, > > > Shahab > > > > > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > I constantly get the following errors when I tried to add splits to a > > > > table. > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): > > > > org.apache.hadoop.hbase.NotServingRegionException: Region > > > > > > > > > > grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568 > > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on > > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359 > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818) > > > > at > > > > > > > > > > > > But when I checked the region server (from hbase' webUI), the region > is > > > > actually listed there. > > > > > > > > What does the error mean actually? How can I solve it? > > > > > > > > Currently I'm adding splits single-threaded, and I want to make it > > > > parallel, is there anything I need to be careful about? > > > > > > > > Here's the code for adding splits: > > > > > > > > def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit > > = { > > > > val admin = new HBaseAdmin(conn) > > > > > > > > try { > > > > val regions = admin.getTableRegions(tableName.getBytes("UTF8")) > > > > val regionStartKeys = regions.map(_.getStartKey) > > > > val splits = splitKeys.diff(regionStartKeys) > > > > > > > > splits.foreach { splitPoint => > > > > admin.split(tableName.getBytes("UTF8"), splitPoint) > > > > } > > > > // NOTE: important! > > > > admin.balancer() > > > > } > > > > finally { > > > > admin.close() > > > > } > > > > } > > > > > > > > > > > > Any help is appreciated. > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?
Yes Esteban, there're very practical reasons to do the pre-split dynamically. Jianshi On Thu, Sep 18, 2014 at 1:41 AM, Esteban Gutierrez wrote: > Hi Jianshi, > > Is there any reason why you need to split dynamically the table? Users > usually pre-split their tables with a specific number of splits or they > pick a region split policy that fits their needs: > > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/DelimitedKeyPrefixRegionSplitPolicy.html > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/ConstantSizeRegionSplitPolicy.html > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy.html > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.html > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/DisabledRegionSplitPolicy.html > > or they have the options to implement their own. See for some details > http://hbase.apache.org/book/regions.arch.html#arch.region.split > > cheers, > esteban. > > > -- > Cloudera, Inc. > > > On Wed, Sep 17, 2014 at 5:06 AM, Shahab Yunus > wrote: > > > Split is an async operation. When you call it, and the call returns, it > > does not mean that the region has been created yet. > > > > So either you wait for a while (using Thread.sleep) or check for the > number > > of regions in a loop and until they have increased to the value you want > > and then access the region. The former is not a good idea, though you can > > try it out just to make sure that this is indeed the issue. > > > > What am I suggesting is something like (pseudo code): > > > > while(new#regions > old#regions) > > { > >new#regions = admin.getLatest#regions > > } > > > > Regards, > > Shahab > > > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang > > wrote: > > > > > I constantly get the following errors when I tried to add splits to a > > > table. > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): > > > org.apache.hadoop.hbase.NotServingRegionException: Region > > > > > > grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568 > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359 > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818) > > > at > > > > > > > > > But when I checked the region server (from hbase' webUI), the region is > > > actually listed there. > > > > > > What does the error mean actually? How can I solve it? > > > > > > Currently I'm adding splits single-threaded, and I want to make it > > > parallel, is there anything I need to be careful about? > > > > > > Here's the code for adding splits: > > > > > > def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit > = { > > > val admin = new HBaseAdmin(conn) > > > > > > try { > > > val regions = admin.getTableRegions(tableName.getBytes("UTF8")) > > > val regionStartKeys = regions.map(_.getStartKey) > > > val splits = splitKeys.diff(regionStartKeys) > > > > > > splits.foreach { splitPoint => > > > admin.split(tableName.getBytes("UTF8"), splitPoint) > > > } > > > // NOTE: important! > > > admin.balancer() > > > } > > > finally { > > > admin.close() > > > } > > > } > > > > > > > > > Any help is appreciated. > > > > > > -- > > > Jianshi Huang > > > > > > LinkedIn: jianshi > > > Twitter: @jshuang > > > Github & Blog: http://huangjs.github.com/ > > > > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?
I see Shahab, async makes sense, but I prefer that the HBase client does the retry for me, and let me specify a timeout parameter. One question, does that mean adding multiple splits into one region has to be done sequentially? How can I add region splits in parallel? Does createTable does something special? Jianshi On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus wrote: > Split is an async operation. When you call it, and the call returns, it > does not mean that the region has been created yet. > > So either you wait for a while (using Thread.sleep) or check for the number > of regions in a loop and until they have increased to the value you want > and then access the region. The former is not a good idea, though you can > try it out just to make sure that this is indeed the issue. > > What am I suggesting is something like (pseudo code): > > while(new#regions > old#regions) > { >new#regions = admin.getLatest#regions > } > > Regards, > Shahab > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang > wrote: > > > I constantly get the following errors when I tried to add splits to a > > table. > > > > > > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): > > org.apache.hadoop.hbase.NotServingRegionException: Region > > > grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568 > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359 > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818) > > at > > > > > > But when I checked the region server (from hbase' webUI), the region is > > actually listed there. > > > > What does the error mean actually? How can I solve it? > > > > Currently I'm adding splits single-threaded, and I want to make it > > parallel, is there anything I need to be careful about? > > > > Here's the code for adding splits: > > > > def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit = { > > val admin = new HBaseAdmin(conn) > > > > try { > > val regions = admin.getTableRegions(tableName.getBytes("UTF8")) > > val regionStartKeys = regions.map(_.getStartKey) > > val splits = splitKeys.diff(regionStartKeys) > > > > splits.foreach { splitPoint => > > admin.split(tableName.getBytes("UTF8"), splitPoint) > > } > > // NOTE: important! > > admin.balancer() > > } > > finally { > > admin.close() > > } > > } > > > > > > Any help is appreciated. > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?
I constantly get the following errors when I tried to add splits to a table. org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): org.apache.hadoop.hbase.NotServingRegionException: Region grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568 484.e7743495366df3c82a8571b36c2bdac3. is not online on lvshdc5dn0193.lvs.paypal.com,60020,1405014719359 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) at org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818) at But when I checked the region server (from hbase' webUI), the region is actually listed there. What does the error mean actually? How can I solve it? Currently I'm adding splits single-threaded, and I want to make it parallel, is there anything I need to be careful about? Here's the code for adding splits: def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit = { val admin = new HBaseAdmin(conn) try { val regions = admin.getTableRegions(tableName.getBytes("UTF8")) val regionStartKeys = regions.map(_.getStartKey) val splits = splitKeys.diff(regionStartKeys) splits.foreach { splitPoint => admin.split(tableName.getBytes("UTF8"), splitPoint) } // NOTE: important! admin.balancer() } finally { admin.close() } } Any help is appreciated. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Deploy filter on per table baiss
Thanks Ted! Jianshi On Tue, Sep 9, 2014 at 10:39 PM, Ted Yu wrote: > Please take a look at HBASE-1936 > > Cheers > > On Mon, Sep 8, 2014 at 11:26 PM, Jianshi Huang > wrote: > > > Hi, > > > > According to the HBAse definitive guide, I need to change to change > > hbase-env.sh and put my jars in hbase's classpath, then I also need to > > restart hbase daemon to make my customized filters effective. > > > > In the Coprocessor loading section, it also mentioned that coprocessor > can > > be setup and loaded on per table basis. > > > > So is it also possible for filter? The main problem is that I don't have > > HBase admin permissions to do the change. > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Deploy filter on per table baiss
Hi, According to the HBAse definitive guide, I need to change to change hbase-env.sh and put my jars in hbase's classpath, then I also need to restart hbase daemon to make my customized filters effective. In the Coprocessor loading section, it also mentioned that coprocessor can be setup and loaded on per table basis. So is it also possible for filter? The main problem is that I don't have HBase admin permissions to do the change. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Locality is important, that why I chose CF to put related data into one group. I can surely put the CF part to the head of rowkey to achieve similar result, but since the number of types is fixed, I don't any benefit doing that. With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the performance should be similar. Am I missing something? Please enlighten me. Jianshi On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel wrote: > I would suggest rethinking column families and look at your potential for > a slightly different row key. > > Going with column families doesn’t really make sense. > > Also how wide are the rows? (worst case?) > > one idea is to make type part of the RK… > > HTH > > -Mike > > On Sep 7, 2014, at 2:40 AM, Jianshi Huang wrote: > > > Hi Michael, > > > > Thanks for the questions. > > > > I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) > have a > > timestamp and I can query things like events between A and B for the > last 7 > > days. > > > > CFs are used for grouping different types of data for the same account. > > However, I have lots of skews in the data, to avoid having too much for > the > > same row, I had to put what was in CQs to now RKs. So CF now acts more > like > > a table. > > > > There's one CF containing sequence of events ordered by timestamp, and > this > > CF is quite different as the use case is mostly in mapreduce jobs. > > > > Jianshi > > > > > > > > > > On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel > > > wrote: > > > >> Again, a silly question. > >> > >> Why are you using column families? > >> > >> Just to play devil’s advocate in terms of design, why are you not > treating > >> your row as a record? Think hierarchal not relational. > >> > >> This really gets in to some design theory. > >> > >> Think Column Family as a way to group data that has the same row key, > >> reference the same thing, yet the data in each column family is used > >> separately. > >> The example I always turn to when teaching, is to think of an order > entry > >> system at a retailer. > >> > >> You generate data which is segmented by business process. (order entry, > >> pick slips, shipping, invoicing) All reflect a single order, yet the > data > >> in each process tends to be accessed separately. > >> (You don’t need the order entry when using the pick slip to pull orders > >> from the warehouse.) So here, the data access pattern is that each > column > >> family is used separately, except in generating the data (the order > entry > >> is used to generate the pick slip(s) and set up things like backorders > and > >> then the pick process generates the shipping slip(s) etc … And since > they > >> are all focused on the same order, they have the same row key. > >> > >> So its reasonable to ask how you are accessing the data and how you are > >> designing your HBase model? > >> > >> Many times, developers create a model using column families because the > >> developer is thinking in terms of relationships. Not access patterns on > the > >> data. > >> > >> Does this make sense? > >> > >> > >> On Sep 6, 2014, at 7:46 PM, Jianshi Huang > wrote: > >> > >>> BTW, a little explanation about the binning I mentioned. > >>> > >>> Currently the rowkey looks like ##. > >>> > >>> And with binning, it looks like > >>> ###. The bin_number > could > >> be > >>> id % 256 or timestamp % 256. And the table could be pre-splitted. So > >> future > >>> ingestions could do parallel insertion to # regions, even without > >>> pre-split. > >>> > >>> > >>> Jianshi > >>> > >>> > >>> On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang > > >>> wrote: > >>> > >>>> Each range might span multiple regions, depending on the data size I > >> want > >>>> scan for MR jobs. > >>>> > >>>> The ranges are dynamic, specified by the user, but the number of bins > >> can > >>>> be static (when the table/schema is created). > >>>> > >>>> Jianshi > >>>> > >>>> > >>>> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu wrote: > >>&g
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Hi Michael, Thanks for the questions. I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a timestamp and I can query things like events between A and B for the last 7 days. CFs are used for grouping different types of data for the same account. However, I have lots of skews in the data, to avoid having too much for the same row, I had to put what was in CQs to now RKs. So CF now acts more like a table. There's one CF containing sequence of events ordered by timestamp, and this CF is quite different as the use case is mostly in mapreduce jobs. Jianshi On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel wrote: > Again, a silly question. > > Why are you using column families? > > Just to play devil’s advocate in terms of design, why are you not treating > your row as a record? Think hierarchal not relational. > > This really gets in to some design theory. > > Think Column Family as a way to group data that has the same row key, > reference the same thing, yet the data in each column family is used > separately. > The example I always turn to when teaching, is to think of an order entry > system at a retailer. > > You generate data which is segmented by business process. (order entry, > pick slips, shipping, invoicing) All reflect a single order, yet the data > in each process tends to be accessed separately. > (You don’t need the order entry when using the pick slip to pull orders > from the warehouse.) So here, the data access pattern is that each column > family is used separately, except in generating the data (the order entry > is used to generate the pick slip(s) and set up things like backorders and > then the pick process generates the shipping slip(s) etc … And since they > are all focused on the same order, they have the same row key. > > So its reasonable to ask how you are accessing the data and how you are > designing your HBase model? > > Many times, developers create a model using column families because the > developer is thinking in terms of relationships. Not access patterns on the > data. > > Does this make sense? > > > On Sep 6, 2014, at 7:46 PM, Jianshi Huang wrote: > > > BTW, a little explanation about the binning I mentioned. > > > > Currently the rowkey looks like ##. > > > > And with binning, it looks like > > ###. The bin_number could > be > > id % 256 or timestamp % 256. And the table could be pre-splitted. So > future > > ingestions could do parallel insertion to # regions, even without > > pre-split. > > > > > > Jianshi > > > > > > On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang > > wrote: > > > >> Each range might span multiple regions, depending on the data size I > want > >> scan for MR jobs. > >> > >> The ranges are dynamic, specified by the user, but the number of bins > can > >> be static (when the table/schema is created). > >> > >> Jianshi > >> > >> > >> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu wrote: > >> > >>> bq. 16 to 256 ranges > >>> > >>> Would each range be within single region or the range may span regions > ? > >>> Are the ranges dynamic ? > >>> > >>> Using command line for multiple ranges would be out of question. A file > >>> with ranges is needed. > >>> > >>> Cheers > >>> > >>> > >>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > >>> wrote: > >>> > >>>> Thanks Ted for the reference. > >>>> > >>>> That's right, extend the row.start and row.end to specify multiple > >>> ranges > >>>> and also getSplits. > >>>> > >>>> I would probably bin the event sequence CF into 16 to 256 bins. So 16 > to > >>>> 256 ranges. > >>>> > >>>> Jianshi > >>>> > >>>> > >>>> > >>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu wrote: > >>>> > >>>>> Please refer to HBASE-5416 Filter on one CF and if a match, then load > >>> and > >>>>> return full row > >>>>> > >>>>> bq. to extend TableInputFormat to accept multiple row ranges > >>>>> > >>>>> You mean extending hbase.mapreduce.scan.row.start and > >>>>> hbase.mapreduce.scan.row.stop so that multiple ranges can be > >>> specified ? > >>>>> How many such ranges do you normally need ? >
Re: One-table w/ multi-CF or multi-table w/ one-CF?
BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like ##. And with binning, it looks like ###. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So future ingestions could do parallel insertion to # regions, even without pre-split. Jianshi On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang wrote: > Each range might span multiple regions, depending on the data size I want > scan for MR jobs. > > The ranges are dynamic, specified by the user, but the number of bins can > be static (when the table/schema is created). > > Jianshi > > > On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu wrote: > >> bq. 16 to 256 ranges >> >> Would each range be within single region or the range may span regions ? >> Are the ranges dynamic ? >> >> Using command line for multiple ranges would be out of question. A file >> with ranges is needed. >> >> Cheers >> >> >> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang >> wrote: >> >> > Thanks Ted for the reference. >> > >> > That's right, extend the row.start and row.end to specify multiple >> ranges >> > and also getSplits. >> > >> > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to >> > 256 ranges. >> > >> > Jianshi >> > >> > >> > >> > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu wrote: >> > >> > > Please refer to HBASE-5416 Filter on one CF and if a match, then load >> and >> > > return full row >> > > >> > > bq. to extend TableInputFormat to accept multiple row ranges >> > > >> > > You mean extending hbase.mapreduce.scan.row.start and >> > > hbase.mapreduce.scan.row.stop so that multiple ranges can be >> specified ? >> > > How many such ranges do you normally need ? >> > > >> > > Cheers >> > > >> > > >> > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang < >> jianshi.hu...@gmail.com> >> > > wrote: >> > > >> > > > Thanks Ted, >> > > > >> > > > I'll pre-split the table during ingestion. The reason to keep the >> > rowkey >> > > > monotonic is for easier working with TableInputFormat, otherwise I >> > > would've >> > > > binned it into 256 splits. (well, I think a good way is to extend >> > > > TableInputFormat to accept multiple row ranges, if there's an >> existing >> > > > efficient implementation, please let me know :) >> > > > >> > > > Would you elaborate a little more on the heap memory usage during >> scan? >> > > Is >> > > > there any reference to that? >> > > > >> > > > Jianshi >> > > > >> > > > >> > > > >> > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu wrote: >> > > > >> > > > > If you use monotonically increasing rowkeys, separating out the >> > column >> > > > > family into a new table would give you same issue you're facing >> > today. >> > > > > >> > > > > Using a single table, essential column family feature would reduce >> > the >> > > > > amount of heap memory used during scan. With two tables, there is >> no >> > > such >> > > > > facility. >> > > > > >> > > > > Cheers >> > > > > >> > > > > >> > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < >> > > jianshi.hu...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Hi Ted, >> > > > > > >> > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the >> > > > > performance >> > > > > > I care most are scan performance. >> > > > > > >> > > > > > It's mostly for analytics, so I don't care much about atomicity >> > > > > currently. >> > > > > > >> > > > > > What's your suggestion? >> > > > > > >> > > > > > Jianshi >> > > > > > >> > > > > > >> > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu >> > wrote: >> > > > > &
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu wrote: > bq. 16 to 256 ranges > > Would each range be within single region or the range may span regions ? > Are the ranges dynamic ? > > Using command line for multiple ranges would be out of question. A file > with ranges is needed. > > Cheers > > > On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang > wrote: > > > Thanks Ted for the reference. > > > > That's right, extend the row.start and row.end to specify multiple ranges > > and also getSplits. > > > > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to > > 256 ranges. > > > > Jianshi > > > > > > > > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu wrote: > > > > > Please refer to HBASE-5416 Filter on one CF and if a match, then load > and > > > return full row > > > > > > bq. to extend TableInputFormat to accept multiple row ranges > > > > > > You mean extending hbase.mapreduce.scan.row.start and > > > hbase.mapreduce.scan.row.stop so that multiple ranges can be specified > ? > > > How many such ranges do you normally need ? > > > > > > Cheers > > > > > > > > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > Thanks Ted, > > > > > > > > I'll pre-split the table during ingestion. The reason to keep the > > rowkey > > > > monotonic is for easier working with TableInputFormat, otherwise I > > > would've > > > > binned it into 256 splits. (well, I think a good way is to extend > > > > TableInputFormat to accept multiple row ranges, if there's an > existing > > > > efficient implementation, please let me know :) > > > > > > > > Would you elaborate a little more on the heap memory usage during > scan? > > > Is > > > > there any reference to that? > > > > > > > > Jianshi > > > > > > > > > > > > > > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu wrote: > > > > > > > > > If you use monotonically increasing rowkeys, separating out the > > column > > > > > family into a new table would give you same issue you're facing > > today. > > > > > > > > > > Using a single table, essential column family feature would reduce > > the > > > > > amount of heap memory used during scan. With two tables, there is > no > > > such > > > > > facility. > > > > > > > > > > Cheers > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < > > > jianshi.hu...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Ted, > > > > > > > > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the > > > > > performance > > > > > > I care most are scan performance. > > > > > > > > > > > > It's mostly for analytics, so I don't care much about atomicity > > > > > currently. > > > > > > > > > > > > What's your suggestion? > > > > > > > > > > > > Jianshi > > > > > > > > > > > > > > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu > > wrote: > > > > > > > > > > > > > Is this the same table you mentioned in the thread about > > > > > > > RegionTooBusyException > > > > > > > ? > > > > > > > > > > > > > > If you move the column family to another table, you may have to > > > > handle > > > > > > > atomicity yourself - currently atomic operations are within > > region > > > > > > > boundaries. > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang < > > > > jianshi.hu...@gmail.com > > > > > > > > > > > >
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu wrote: > Please refer to HBASE-5416 Filter on one CF and if a match, then load and > return full row > > bq. to extend TableInputFormat to accept multiple row ranges > > You mean extending hbase.mapreduce.scan.row.start and > hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? > How many such ranges do you normally need ? > > Cheers > > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang > wrote: > > > Thanks Ted, > > > > I'll pre-split the table during ingestion. The reason to keep the rowkey > > monotonic is for easier working with TableInputFormat, otherwise I > would've > > binned it into 256 splits. (well, I think a good way is to extend > > TableInputFormat to accept multiple row ranges, if there's an existing > > efficient implementation, please let me know :) > > > > Would you elaborate a little more on the heap memory usage during scan? > Is > > there any reference to that? > > > > Jianshi > > > > > > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu wrote: > > > > > If you use monotonically increasing rowkeys, separating out the column > > > family into a new table would give you same issue you're facing today. > > > > > > Using a single table, essential column family feature would reduce the > > > amount of heap memory used during scan. With two tables, there is no > such > > > facility. > > > > > > Cheers > > > > > > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > Hi Ted, > > > > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the > > > performance > > > > I care most are scan performance. > > > > > > > > It's mostly for analytics, so I don't care much about atomicity > > > currently. > > > > > > > > What's your suggestion? > > > > > > > > Jianshi > > > > > > > > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu wrote: > > > > > > > > > Is this the same table you mentioned in the thread about > > > > > RegionTooBusyException > > > > > ? > > > > > > > > > > If you move the column family to another table, you may have to > > handle > > > > > atomicity yourself - currently atomic operations are within region > > > > > boundaries. > > > > > > > > > > Cheers > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang < > > jianshi.hu...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I'm currently putting everything into one table (to make cross > > > > reference > > > > > > queries easier) and there's one CF which contains rowkeys very > > > > different > > > > > to > > > > > > the rest. Currently it works well, but I'm wondering if it will > > cause > > > > > > performance issues in the future. > > > > > > > > > > > > So my questions are > > > > > > > > > > > > 1) will there be performance penalties in the way I'm doing? > > > > > > 2) should I move that CF to a separate table? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu wrote: > If you use monotonically increasing rowkeys, separating out the column > family into a new table would give you same issue you're facing today. > > Using a single table, essential column family feature would reduce the > amount of heap memory used during scan. With two tables, there is no such > facility. > > Cheers > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang > wrote: > > > Hi Ted, > > > > Yes, that's the table having RegionTooBusyExceptions :) But the > performance > > I care most are scan performance. > > > > It's mostly for analytics, so I don't care much about atomicity > currently. > > > > What's your suggestion? > > > > Jianshi > > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu wrote: > > > > > Is this the same table you mentioned in the thread about > > > RegionTooBusyException > > > ? > > > > > > If you move the column family to another table, you may have to handle > > > atomicity yourself - currently atomic operations are within region > > > boundaries. > > > > > > Cheers > > > > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang > > > > wrote: > > > > > > > Hi, > > > > > > > > I'm currently putting everything into one table (to make cross > > reference > > > > queries easier) and there's one CF which contains rowkeys very > > different > > > to > > > > the rest. Currently it works well, but I'm wondering if it will cause > > > > performance issues in the future. > > > > > > > > So my questions are > > > > > > > > 1) will there be performance penalties in the way I'm doing? > > > > 2) should I move that CF to a separate table? > > > > > > > > > > > > Thanks, > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Well, write performance is also important... I'll probably ingest 1k~10k records/second. Jianshi On Sun, Sep 7, 2014 at 1:11 AM, Jianshi Huang wrote: > Hi Ted, > > Yes, that's the table having RegionTooBusyExceptions :) But the > performance I care most are scan performance. > > It's mostly for analytics, so I don't care much about atomicity currently. > > What's your suggestion? > > Jianshi > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu wrote: > >> Is this the same table you mentioned in the thread about >> RegionTooBusyException >> ? >> >> If you move the column family to another table, you may have to handle >> atomicity yourself - currently atomic operations are within region >> boundaries. >> >> Cheers >> >> >> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang >> wrote: >> >> > Hi, >> > >> > I'm currently putting everything into one table (to make cross reference >> > queries easier) and there's one CF which contains rowkeys very >> different to >> > the rest. Currently it works well, but I'm wondering if it will cause >> > performance issues in the future. >> > >> > So my questions are >> > >> > 1) will there be performance penalties in the way I'm doing? >> > 2) should I move that CF to a separate table? >> > >> > >> > Thanks, >> > -- >> > Jianshi Huang >> > >> > LinkedIn: jianshi >> > Twitter: @jshuang >> > Github & Blog: http://huangjs.github.com/ >> > >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu wrote: > Is this the same table you mentioned in the thread about > RegionTooBusyException > ? > > If you move the column family to another table, you may have to handle > atomicity yourself - currently atomic operations are within region > boundaries. > > Cheers > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang > wrote: > > > Hi, > > > > I'm currently putting everything into one table (to make cross reference > > queries easier) and there's one CF which contains rowkeys very different > to > > the rest. Currently it works well, but I'm wondering if it will cause > > performance issues in the future. > > > > So my questions are > > > > 1) will there be performance penalties in the way I'm doing? > > 2) should I move that CF to a separate table? > > > > > > Thanks, > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
One-table w/ multi-CF or multi-table w/ one-CF?
Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Thanks Ted! Didn't know I still need to run the 'balancer' command. Is there a way to do it programmatically? Jianshi On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu wrote: > After splitting the region, you may need to run balancer to spread the new > regions out. > > Cheers > > > On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang > wrote: > > > Hi Shahab, > > > > I see, that seems to be the right way... > > > > > > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus > > wrote: > > > > > Shahab > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Hi Steven, I did 1) and 2) and the error was during LoadIncrementalHFiles. I can't do 3) because that CF is mostly used for mapreduce inputs, so a continuous rowkey is preferred. Jianshi On Sat, Sep 6, 2014 at 12:29 AM, Magana-zook, Steven Alan < maganazo...@llnl.gov> wrote: > Jianshi, > > I have seen many solutions to importing this kind of data: > > 1. Pre-splitting regions (I did not try this) > > 2. Using a map reduce job to create HFiles instead of putting individual > rows into the database > (instructions here: http://hbase.apache.org/book/arch.bulk.load.html > > 3. Modifying the row key to not be monotonic > > I went with the third solution by pre-prending a random integer before the > other fields in my composite row key ( "__ field 2>Š.") > > When you make any changes, you can verify it is working by viewing the > Hbase web interface (port 60010 on the hbase master) to see the requests > per second on the various region servers. > > > Thank you, > Steven Magana-Zook > > > > > > > On 9/5/14 9:14 AM, "Jianshi Huang" wrote: > > >Thanks Ted, I'll try to do a major compact. > > > >Hi Steven, > > > >Yes, most of my rows are hashed to make it randomly distributed, but one > >column family has monotonically increasing rowkeys, and it's used for > >recording sequence of events. > > > >Do you have a solution how to bulk import this kind of data? > > > >Jianshi > > > > > > > >On Sat, Sep 6, 2014 at 12:00 AM, Magana-zook, Steven Alan < > >maganazo...@llnl.gov> wrote: > > > >> Hi Jianshi, > >> > >> What are the field(s) in your row key? If your row key is monotonically > >> increasing then you will be sending all of your requests to one region > >> server. Even after the region splits, all new entries will keep > >>punishing > >> one server (the region responsible for the split containing the new > >>keys). > >> > >> See these articles that may help if this is indeed your issue: > >> 1. http://hbase.apache.org/book/rowkey.design.html > >> 2. > >> > >> > http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-inc > >>re > >> asing-values-are-bad/ > >> > >> Regards, > >> Steven Magana-Zook > >> > >> > >> > >> > >> > >> > >> On 9/5/14 8:54 AM, "Jianshi Huang" wrote: > >> > >> >Hi JM, > >> > > >> >What do you mean by the 'destination cluster'? The files are in the > >>same > >> >Hadoop/HDFS cluster where HBase is running. > >> > > >> >Do you mean do the bulk importing on HBase Master node? > >> > > >> > > >> >Jianshi > >> > > >> > > >> >On Fri, Sep 5, 2014 at 11:18 PM, Jean-Marc Spaggiari < > >> >jean-m...@spaggiari.org> wrote: > >> > > >> >> Hi Jianshi, > >> >> > >> >> You might want to upload the file on the destination cluster first > >>and > >> >>then > >> >> re-run your bulk load from there. That way the transfer time will > >>not be > >> >> taken into consideration for the timeout size the files will be > >>local. > >> >> > >> >> JM > >> >> > >> >> > >> >> 2014-09-05 11:15 GMT-04:00 Jianshi Huang : > >> >> > >> >> > I'm importing 2TB of generated HFiles to HBase and I constantly get > >> >>the > >> >> > following errors: > >> >> > > >> >> > Caused by: > >> >> > > >> >> > > >> >> > >> > >>>>org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop > >>>>.h > >> >>base.RegionTooBusyException): > >> >> > org.apache.hadoop.hbase.RegionTooBusyException: failed to get a > >>lock > >> >>in > >> >> > 6 ms. > >> >> > > >> >> > > >> >> > >> > >>>>regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de73 > >>>>3f > >> >>3565642d96., > >> >> > server=x.xxx.xxx,60020,1404854700728 > >> >> > at > >> >> > > >>org.apache.hadoop.hbase.regionserver.HRegion.l
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Hi Shahab, I see, that seems to be the right way... On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus wrote: > Shahab -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Thanks Ted, I'll try to do a major compact. Hi Steven, Yes, most of my rows are hashed to make it randomly distributed, but one column family has monotonically increasing rowkeys, and it's used for recording sequence of events. Do you have a solution how to bulk import this kind of data? Jianshi On Sat, Sep 6, 2014 at 12:00 AM, Magana-zook, Steven Alan < maganazo...@llnl.gov> wrote: > Hi Jianshi, > > What are the field(s) in your row key? If your row key is monotonically > increasing then you will be sending all of your requests to one region > server. Even after the region splits, all new entries will keep punishing > one server (the region responsible for the split containing the new keys). > > See these articles that may help if this is indeed your issue: > 1. http://hbase.apache.org/book/rowkey.design.html > 2. > http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-incre > asing-values-are-bad/ > > Regards, > Steven Magana-Zook > > > > > > > On 9/5/14 8:54 AM, "Jianshi Huang" wrote: > > >Hi JM, > > > >What do you mean by the 'destination cluster'? The files are in the same > >Hadoop/HDFS cluster where HBase is running. > > > >Do you mean do the bulk importing on HBase Master node? > > > > > >Jianshi > > > > > >On Fri, Sep 5, 2014 at 11:18 PM, Jean-Marc Spaggiari < > >jean-m...@spaggiari.org> wrote: > > > >> Hi Jianshi, > >> > >> You might want to upload the file on the destination cluster first and > >>then > >> re-run your bulk load from there. That way the transfer time will not be > >> taken into consideration for the timeout size the files will be local. > >> > >> JM > >> > >> > >> 2014-09-05 11:15 GMT-04:00 Jianshi Huang : > >> > >> > I'm importing 2TB of generated HFiles to HBase and I constantly get > >>the > >> > following errors: > >> > > >> > Caused by: > >> > > >> > > >> > >>org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.h > >>base.RegionTooBusyException): > >> > org.apache.hadoop.hbase.RegionTooBusyException: failed to get a lock > >>in > >> > 6 ms. > >> > > >> > > >> > >>regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de733f > >>3565642d96., > >> > server=x.xxx.xxx,60020,1404854700728 > >> > at > >> > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5851) > >> > at > >> > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5837) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.regionserver.HRegion.startBulkRegionOperation(HRe > >>gion.java:5795) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java: > >>3543) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java: > >>3525) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFile(HRegionS > >>erver.java:3277) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.c > >>allBlockingMethod(ClientProtos.java:28863) > >> > at > >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > >> > at > >>org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSche > >>duler.java:160) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedu > >>ler.java:38) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.j > >>ava:110) > >> > at java.lang.Thread.run(Thread.java:724) > >> > > >> > at > >> org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1498) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1 > >>684) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.ca > >>llBlockingMethod(RpcClient.java:1737) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$Blo > >>ckingStub.bulkLoadHFile(ClientProtos.java:29276) > >> > at > >> > > >> > > >> > >>org.apache.hadoop.hbase.protobuf.ProtobufUtil.bulkLoadHFile(ProtobufUtil. > >>java:1548) > >> > ... 11 more > >> > > >> > > >> > What makes the region too busy? Is there a way to improve it? > >> > > >> > Does that also mean some part of my data are not correctly imported? > >> > > >> > > >> > Thanks, > >> > > >> > -- > >> > Jianshi Huang > >> > > >> > LinkedIn: jianshi > >> > Twitter: @jshuang > >> > Github & Blog: http://huangjs.github.com/ > >> > > >> > > > > > > > >-- > >Jianshi Huang > > > >LinkedIn: jianshi > >Twitter: @jshuang > >Github & Blog: http://huangjs.github.com/ > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms
Hi JM, What do you mean by the 'destination cluster'? The files are in the same Hadoop/HDFS cluster where HBase is running. Do you mean do the bulk importing on HBase Master node? Jianshi On Fri, Sep 5, 2014 at 11:18 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Hi Jianshi, > > You might want to upload the file on the destination cluster first and then > re-run your bulk load from there. That way the transfer time will not be > taken into consideration for the timeout size the files will be local. > > JM > > > 2014-09-05 11:15 GMT-04:00 Jianshi Huang : > > > I'm importing 2TB of generated HFiles to HBase and I constantly get the > > following errors: > > > > Caused by: > > > > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.RegionTooBusyException): > > org.apache.hadoop.hbase.RegionTooBusyException: failed to get a lock in > > 6 ms. > > > > > regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de733f3565642d96., > > server=x.xxx.xxx,60020,1404854700728 > > at > > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5851) > > at > > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5837) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegion.startBulkRegionOperation(HRegion.java:5795) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3543) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3525) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFile(HRegionServer.java:3277) > > at > > > > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28863) > > at > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > > at > > > > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > > at > > > > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > > at > > > > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > > at java.lang.Thread.run(Thread.java:724) > > > > at > org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1498) > > at > > > > > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684) > > at > > > > > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737) > > at > > > > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.bulkLoadHFile(ClientProtos.java:29276) > > at > > > > > org.apache.hadoop.hbase.protobuf.ProtobufUtil.bulkLoadHFile(ProtobufUtil.java:1548) > > ... 11 more > > > > > > What makes the region too busy? Is there a way to improve it? > > > > Does that also mean some part of my data are not correctly imported? > > > > > > Thanks, > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Help: RegionTooBusyException: failed to get a lock in 60000 ms
I'm importing 2TB of generated HFiles to HBase and I constantly get the following errors: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.RegionTooBusyException): org.apache.hadoop.hbase.RegionTooBusyException: failed to get a lock in 6 ms. regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de733f3565642d96., server=x.xxx.xxx,60020,1404854700728 at org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5851) at org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5837) at org.apache.hadoop.hbase.regionserver.HRegion.startBulkRegionOperation(HRegion.java:5795) at org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3543) at org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3525) at org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFile(HRegionServer.java:3277) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28863) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) at java.lang.Thread.run(Thread.java:724) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1498) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.bulkLoadHFile(ClientProtos.java:29276) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.bulkLoadHFile(ProtobufUtil.java:1548) ... 11 more What makes the region too busy? Is there a way to improve it? Does that also mean some part of my data are not correctly imported? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: ResultScanner performance
Ah, sure. That's a good idea. I know how to do it now. :) Thanks for the help. Jianshi On Thu, Aug 28, 2014 at 12:29 PM, Ted Yu wrote: > You can enhance ColumnRangeFilter to return the first column in the range. > > In its filterKeyValue(Cell kv) method: > > int cmpMax = Bytes.compareTo(buffer, qualifierOffset, qualifierLength, > > this.maxColumn, 0, this.maxColumn.length); > > if (this.maxColumnInclusive && cmpMax <= 0 || > > !this.maxColumnInclusive && cmpMax < 0) { > > return ReturnCode.INCLUDE; > > } > > ReturnCode.NEXT_ROW should be returned (for subsequent columns) once > ReturnCode.INCLUDE is returned for the first column in range. > > Cheers > > > On Wed, Aug 27, 2014 at 9:05 PM, Jianshi Huang > wrote: > > > Very similar. We setup a column range (we're using ColumnRangeFilter > right > > now), and we want the first column in the range. > > > > The problem we have a lot of rows. > > > > If there's no such capability, then we need to control the parallelism > > ourselves. > > > > Shall I sort the rows first before scanning? Will a random order be more > > efficient if we have many servers? > > > > Jianshi > > > > > > On Thu, Aug 28, 2014 at 1:44 AM, Ted Yu wrote: > > > > > So you want to specify several columns. e.g. c2, c3, and c4, the GET is > > > supposed to return the first one of them (doesn't have to be c2, can be > > c3 > > > if c2 is absent) ? > > > > > > To my knowledge there is no such capability now. > > > > > > Cheers > > > > > > > > > On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang < > jianshi.hu...@gmail.com > > > > > > wrote: > > > > > > > On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang < > > jianshi.hu...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > There's a special but common case that for each row we only need > the > > > > first > > > > > column. Is there a better way to do this than multiple scans + > > take(1)? > > > > > > > > > > > > > We still need to set a column range, is there a way to get the first > > > column > > > > value of a range using GET? > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: ResultScanner performance
Very similar. We setup a column range (we're using ColumnRangeFilter right now), and we want the first column in the range. The problem we have a lot of rows. If there's no such capability, then we need to control the parallelism ourselves. Shall I sort the rows first before scanning? Will a random order be more efficient if we have many servers? Jianshi On Thu, Aug 28, 2014 at 1:44 AM, Ted Yu wrote: > So you want to specify several columns. e.g. c2, c3, and c4, the GET is > supposed to return the first one of them (doesn't have to be c2, can be c3 > if c2 is absent) ? > > To my knowledge there is no such capability now. > > Cheers > > > On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang > wrote: > > > On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang > > wrote: > > > > > > > > There's a special but common case that for each row we only need the > > first > > > column. Is there a better way to do this than multiple scans + take(1)? > > > > > > > We still need to set a column range, is there a way to get the first > column > > value of a range using GET? > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: ResultScanner performance
On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang wrote: > > There's a special but common case that for each row we only need the first > column. Is there a better way to do this than multiple scans + take(1)? > We still need to set a column range, is there a way to get the first column value of a range using GET? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: ResultScanner performance
Hi, The reason we cannot close the ResultScanner (or issue a multi-get), is that we have wide rows with many columns, and we want to iterate over them rather than get all the columns at once. There's a special but common case that for each row we only need the first column. Is there a better way to do this than multiple scans + take(1)? Jianshi On Wed, Aug 27, 2014 at 12:44 PM, Dai, Kevin wrote: > Hi, Ted > > I think you are right. But we must hold the ResultScanner for a while. So > is there any way to reduce the performance loss? Or is there any way to > share the connection? > > Best regards, > Kevin. > > -Original Message- > From: Ted Yu [mailto:yuzhih...@gmail.com] > Sent: 2014年8月27日 11:36 > To: user@hbase.apache.org > Subject: Re: ResultScanner performance > > Keeping many ResultScanners open at the same time is not good for > performance. > > Please see: > http://hbase.apache.org/book.html#perf.hbase.client.scannerclose > > After fetching results from ResultScanner, you should close it ASAP. > > Cheers > > > On Tue, Aug 26, 2014 at 8:18 PM, Dai, Kevin wrote: > > > Hi, Ted > > > > We have a cluster of 48 machines and at least 100T data(which is still > > increasing). > > The problem is that we have a lot of row keys (about tens of thousands > > ) to query in the meantime and we don't fetch all the data at once, > > instead we fetch them when needed, so we may hold tens of thousands > > ResultScanner in the meantime. > > I want to know whether it will hurt the performance and network > > resources and if so, is there any way to solve it? > > > > Best regards, > > Kevin. > > -Original Message- > > From: Ted Yu [mailto:yuzhih...@gmail.com] > > Sent: 2014年8月26日 16:49 > > To: user@hbase.apache.org > > Cc: user@hbase.apache.org; Huang, Jianshi > > Subject: Re: ResultScanner performance > > > > Can you give a bit more detail ? > > What size is the cluster / dataset ? > > What problem are you solving ? > > Would using coprocessor help reduce the usage of ResultScanner ? > > > > Cheers > > > > On Aug 26, 2014, at 12:13 AM, "Dai, Kevin" wrote: > > > > > Hi, everyone > > > > > > My application will hold tens of thousands of ResultScanner to get > Data. > > Will it hurt the performance and network resources? > > > If so, is there any way to solve it? > > > Thanks, > > > Kevin. > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Hbase InputFormat for multi-row + column range, how to do it?
I see and I'll try. Thanks Andrey! Jianshi On Wed, Aug 20, 2014 at 6:01 PM, Andrey Stepachev wrote: > Hi Jianshi. > > You can create your own. Just inherit from TableInputFormatBase or > TableInputFormat and add ColumnRangeFilter to scan (either construct your > own, or intercept setScan method). > > Hope this helps. > > -- > Andrey. > > > On Wed, Aug 20, 2014 at 1:35 PM, Jianshi Huang > wrote: > > > Hi, > > > > I know TableInputFormat and HFileInputFormat can both set ROW_START and > > ROW_END, but none of them can set the column range (like what we do in > > ColumnRangeFilter). > > > > So how can I do column range in HBase InputFormat? Is there an > > implementation available? If not, how much effort do you think it takes > to > > implement one? > > > > Best, > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > > > > -- > Andrey. > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Hbase InputFormat for multi-row + column range, how to do it?
Hi, I know TableInputFormat and HFileInputFormat can both set ROW_START and ROW_END, but none of them can set the column range (like what we do in ColumnRangeFilter). So how can I do column range in HBase InputFormat? Is there an implementation available? If not, how much effort do you think it takes to implement one? Best, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: How are split files distributed across Region servers?
Ok, I found some reference. I was actually asking the default load balancer of HBase. And by googling, it seems it only makes the number of regions even across region servers, but the distribution of regions are random. Also found good load balancer implementation like this: https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html Thanks for the help JM! :) Jianshi On Tue, Aug 19, 2014 at 2:31 PM, lars hofhansl wrote: > I'd change the max file size to 20GB. That'd give you 5000 regions for > 100TB. > > > > ________ > From: Jianshi Huang > To: user@hbase.apache.org > Sent: Monday, August 18, 2014 12:22 PM > Subject: Re: How are split files distributed across Region servers? > > > Hi JM, > > Make the range bigger you mean to make it multiple regions/splits, right? > > I probably will have >100TB of data, and I think the default split file > size is 10GB. So I can assume each of my 100 machines will get assigned to > 100 *random* regions? > > Where can I find the implementation details or settings for region > assignment? > > Jianshi > > > > On Mon, Aug 18, 2014 at 8:48 PM, Jean-Marc Spaggiari < > jean-m...@spaggiari.org> wrote: > > > Hi Jianshi, > > > > A region server can host more than one region. So if you pre-split your > > table correctly based on your access usage, at the end all the servers > > should be used evenly. > > > > If you have about 30% or your range which is not used, just make sure > that > > this range is bigger so at the end it will have the same load at the > > others. > > > > JM > > > > > > 2014-08-18 2:08 GMT-04:00 Jianshi Huang : > > > > > Hi JM, > > > > > > If the region boundaries will not change, does that mean, > > > > > > If my data access pattern has skews (say a certain part (30%) of my > data > > > will almost never be used), then a proportion (30%) of my server will > > > always be idle? > > > > > > A region server has to have a continuous rowkey range? > > > > > > Jianshi > > > > > > > > > > > > > > > On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari < > > > jean-m...@spaggiari.org> wrote: > > > > > > > H Jianshi, > > > > > > > > Not sure to get your question. > > > > > > > > Can I rephrase it? > > > > > > > > So you have 10 regions, and each of those regions has 10 HFiles. Then > > you > > > > run a major compaction on the table. Correct? > > > > > > > > Then you will end up with: > > > > > > > > reg1:[files:1] > > > > reg2:[files:2] > > > > reg3:[files:3] > > > > ... > > > > > > > > Regions boundaries will not change. But each region will not have a > > > single > > > > underlaying file. > > > > > > > > HTH, > > > > > > > > JM > > > > > > > > > > > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang : > > > > > > > > > Say I have 100 split files on 10 region servers, and I did a major > > > > compact. > > > > > > > > > > Will these split files be distributed like this: > > > > > reg1: [splits 1,2,..,10] > > > > > reg2: [splits 11,12,...,20] > > > > > ... > > > > > > > > > > Or like this: > > > > > reg1: [splits: 1, 11, 21, ... , 91] > > > > > reg2: [splits: 2, 12, 22, ... , 92] > > > > > ... > > > > > > > > > > And if I want to specify the locality and the stride of split > files? > > > How > > > > > can I do it in HBase? > > > > > > > > > > > > > > > -- > > > > > Jianshi Huang > > > > > > > > > > LinkedIn: jianshi > > > > > Twitter: @jshuang > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Jianshi Huang > > > > > > LinkedIn: jianshi > > > Twitter: @jshuang > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: How are split files distributed across Region servers?
Hi JM, Make the range bigger you mean to make it multiple regions/splits, right? I probably will have >100TB of data, and I think the default split file size is 10GB. So I can assume each of my 100 machines will get assigned to 100 *random* regions? Where can I find the implementation details or settings for region assignment? Jianshi On Mon, Aug 18, 2014 at 8:48 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Hi Jianshi, > > A region server can host more than one region. So if you pre-split your > table correctly based on your access usage, at the end all the servers > should be used evenly. > > If you have about 30% or your range which is not used, just make sure that > this range is bigger so at the end it will have the same load at the > others. > > JM > > > 2014-08-18 2:08 GMT-04:00 Jianshi Huang : > > > Hi JM, > > > > If the region boundaries will not change, does that mean, > > > > If my data access pattern has skews (say a certain part (30%) of my data > > will almost never be used), then a proportion (30%) of my server will > > always be idle? > > > > A region server has to have a continuous rowkey range? > > > > Jianshi > > > > > > > > > > On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari < > > jean-m...@spaggiari.org> wrote: > > > > > H Jianshi, > > > > > > Not sure to get your question. > > > > > > Can I rephrase it? > > > > > > So you have 10 regions, and each of those regions has 10 HFiles. Then > you > > > run a major compaction on the table. Correct? > > > > > > Then you will end up with: > > > > > > reg1:[files:1] > > > reg2:[files:2] > > > reg3:[files:3] > > > ... > > > > > > Regions boundaries will not change. But each region will not have a > > single > > > underlaying file. > > > > > > HTH, > > > > > > JM > > > > > > > > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang : > > > > > > > Say I have 100 split files on 10 region servers, and I did a major > > > compact. > > > > > > > > Will these split files be distributed like this: > > > > reg1: [splits 1,2,..,10] > > > > reg2: [splits 11,12,...,20] > > > > ... > > > > > > > > Or like this: > > > > reg1: [splits: 1, 11, 21, ... , 91] > > > > reg2: [splits: 2, 12, 22, ... , 92] > > > > ... > > > > > > > > And if I want to specify the locality and the stride of split files? > > How > > > > can I do it in HBase? > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: How are split files distributed across Region servers?
Hi JM, If the region boundaries will not change, does that mean, If my data access pattern has skews (say a certain part (30%) of my data will almost never be used), then a proportion (30%) of my server will always be idle? A region server has to have a continuous rowkey range? Jianshi On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > H Jianshi, > > Not sure to get your question. > > Can I rephrase it? > > So you have 10 regions, and each of those regions has 10 HFiles. Then you > run a major compaction on the table. Correct? > > Then you will end up with: > > reg1:[files:1] > reg2:[files:2] > reg3:[files:3] > ... > > Regions boundaries will not change. But each region will not have a single > underlaying file. > > HTH, > > JM > > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang : > > > Say I have 100 split files on 10 region servers, and I did a major > compact. > > > > Will these split files be distributed like this: > > reg1: [splits 1,2,..,10] > > reg2: [splits 11,12,...,20] > > ... > > > > Or like this: > > reg1: [splits: 1, 11, 21, ... , 91] > > reg2: [splits: 2, 12, 22, ... , 92] > > ... > > > > And if I want to specify the locality and the stride of split files? How > > can I do it in HBase? > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
How are split files distributed across Region servers?
Say I have 100 split files on 10 region servers, and I did a major compact. Will these split files be distributed like this: reg1: [splits 1,2,..,10] reg2: [splits 11,12,...,20] ... Or like this: reg1: [splits: 1, 11, 21, ... , 91] reg2: [splits: 2, 12, 22, ... , 92] ... And if I want to specify the locality and the stride of split files? How can I do it in HBase? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: How to create a connection pool with specified pool size?
I see. Thank you Ted for the help. :) Jianshi On Mon, Aug 11, 2014 at 9:57 PM, Ted Yu wrote: > If you use the following method: > > public static HConnection > < > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html > > > createConnection(org.apache.hadoop.conf.Configuration conf, >ExecutorService > < > http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html?is-external=true > > > pool) > > You can pass your own ExecutorService. > > See example in > http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html?is-external=true > > Cheers > > > > On Mon, Aug 11, 2014 at 2:40 AM, Jianshi Huang > wrote: > > > I followed the manual and uses HConnectionManager.createConnection to > > create a connection pool. > > > > However I couldn't find reference about how to specify the pool size? It > > should be in the second parameter pool of type ExecutorService, right? > How > > can I do that? > > > > Cheers, > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
How to create a connection pool with specified pool size?
I followed the manual and uses HConnectionManager.createConnection to create a connection pool. However I couldn't find reference about how to specify the pool size? It should be in the second parameter pool of type ExecutorService, right? How can I do that? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Best practice for writing to HFileOutputFormat(2) with multiple Column Families
I know HBase will set the TotalOrderPartitioner in MR, but in Spark, I need to sort the rows myself. Jianshi On Sat, Aug 2, 2014 at 12:24 AM, Arun Allamsetty wrote: > Hi Jianshi, > > Do you mean that you want to sort the row keys? If yes, then you don't have > to worry about it because HBase sorts the row keys on its own but > lexicographically. > > Cheers, > Arun > > Sent from a mobile device. Please don't mind the typos. > On Jul 30, 2014 9:02 PM, "Jianshi Huang" wrote: > > > I need to generate from a 2TB dataset and exploded it to 4 Column > Families. > > > > The result dataset is likely to be 20TB or more. I'm currently using > Spark > > so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to > > optimize it. > > > > My question is: > > Should I sort and write each column family one by one, or should I put > them > > all together then do sort and write? > > > > Does my question make sense? > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Best practice for writing to HFileOutputFormat(2) with multiple Column Families
I need to generate from a 2TB dataset and exploded it to 4 Column Families. The result dataset is likely to be 20TB or more. I'm currently using Spark so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to optimize it. My question is: Should I sort and write each column family one by one, or should I put them all together then do sort and write? Does my question make sense? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Completebulkload with namespace option?
Wow, thanks! :) On Thu, Jul 31, 2014 at 10:07 AM, Ted Yu wrote: > Matteo acted very fast - this has been fixed by HBASE-11609 > > Cheers > > > On Wed, Jul 30, 2014 at 7:02 PM, Jianshi Huang > wrote: > > > Created a Jira issue. > > > > https://issues.apache.org/jira/browse/HBASE-11622 > > > > > > On Tue, Jul 29, 2014 at 11:46 PM, Bharath Vissapragada < > > bhara...@cloudera.com> wrote: > > > > > Appears to be a bug. It should be TableName.valueOf(...) or something > > > similar. Mind filing a jira? > > > > > > > > > On Tue, Jul 29, 2014 at 12:22 PM, Jianshi Huang < > jianshi.hu...@gmail.com > > > > > > wrote: > > > > > > > I see why, looking at the source code of LoadIncrementalHFiles.java, > it > > > > seems the temporary path created for splitting will contain ':', > > > > > > > > The error part should be this: > > > > String uniqueName = getUniqueName(table.getName()); > > > > HColumnDescriptor familyDesc = > > > > table.getTableDescriptor().getFamily(item.family); > > > > Path botOut = new Path(tmpDir, uniqueName + ".bottom"); > > > > Path topOut = new Path(tmpDir, uniqueName + ".top"); > > > > splitStoreFile(getConf(), hfilePath, familyDesc, splitKey, > > > > botOut, topOut); > > > > > > > > uniqueName will be "namespce:table" so new Path will fail. > > > > > > > > A bug right? > > > > > > > > Jianshi > > > > > > > > > > > > On Tue, Jul 29, 2014 at 2:42 PM, Jianshi Huang < > > jianshi.hu...@gmail.com> > > > > wrote: > > > > > > > > > I'm using hbase 0.98 with HDP 2.1. > > > > > > > > > > > > > > > On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang < > > > jianshi.hu...@gmail.com> > > > > > wrote: > > > > > > > > > >> I'm using completebulkload to load 500GB of data to a table > > > > >> (presplitted). However, it reports the following errors: > > > > >> > > > > >> Looks like completebulkload didn't recognize the namespace part > > > > >> (namespace:table). > > > > >> > > > > >> Is there an option to do it? I can't find one in Google... > > > > >> > > > > >> Exception in thread "main" 14/07/28 23:32:19 INFO > > > > >> mapreduce.LoadIncrementalHFiles: Trying to load > > > > >> hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3 > > > > >> first=dc595cfe#cust#1812199228741466242 > > > > >> last=dc68cedc#cust#2251647837553603393 > > > > >> java.lang.reflect.InvocationTargetException > > > > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > > > > >> at > > > > >> > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > >> at > > > > >> > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > >> at java.lang.reflect.Method.invoke(Method.java:606) > > > > >> at > > > org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54) > > > > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > > > > >> at > > > > >> > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > >> at > > > > >> > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > >> at java.lang.reflect.Method.invoke(Method.java:606) > > > > >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > > > > >> Caused by: java.lang.IllegalStateException: > > > > >> java.lang.IllegalArgumentException: java.net.URISyntaxException: > > > > Relative > > > > >> path in absolute URI: grapple:vertices,37.bottom > > > > >> at > > > > >> > > >
Re: Completebulkload with namespace option?
Created a Jira issue. https://issues.apache.org/jira/browse/HBASE-11622 On Tue, Jul 29, 2014 at 11:46 PM, Bharath Vissapragada < bhara...@cloudera.com> wrote: > Appears to be a bug. It should be TableName.valueOf(...) or something > similar. Mind filing a jira? > > > On Tue, Jul 29, 2014 at 12:22 PM, Jianshi Huang > wrote: > > > I see why, looking at the source code of LoadIncrementalHFiles.java, it > > seems the temporary path created for splitting will contain ':', > > > > The error part should be this: > > String uniqueName = getUniqueName(table.getName()); > > HColumnDescriptor familyDesc = > > table.getTableDescriptor().getFamily(item.family); > > Path botOut = new Path(tmpDir, uniqueName + ".bottom"); > > Path topOut = new Path(tmpDir, uniqueName + ".top"); > > splitStoreFile(getConf(), hfilePath, familyDesc, splitKey, > > botOut, topOut); > > > > uniqueName will be "namespce:table" so new Path will fail. > > > > A bug right? > > > > Jianshi > > > > > > On Tue, Jul 29, 2014 at 2:42 PM, Jianshi Huang > > wrote: > > > > > I'm using hbase 0.98 with HDP 2.1. > > > > > > > > > On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > >> I'm using completebulkload to load 500GB of data to a table > > >> (presplitted). However, it reports the following errors: > > >> > > >> Looks like completebulkload didn't recognize the namespace part > > >> (namespace:table). > > >> > > >> Is there an option to do it? I can't find one in Google... > > >> > > >> Exception in thread "main" 14/07/28 23:32:19 INFO > > >> mapreduce.LoadIncrementalHFiles: Trying to load > > >> hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3 > > >> first=dc595cfe#cust#1812199228741466242 > > >> last=dc68cedc#cust#2251647837553603393 > > >> java.lang.reflect.InvocationTargetException > > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > >> at > > >> > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > >> at > > >> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > >> at java.lang.reflect.Method.invoke(Method.java:606) > > >> at > org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54) > > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > >> at > > >> > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > >> at > > >> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > >> at java.lang.reflect.Method.invoke(Method.java:606) > > >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > > >> Caused by: java.lang.IllegalStateException: > > >> java.lang.IllegalArgumentException: java.net.URISyntaxException: > > Relative > > >> path in absolute URI: grapple:vertices,37.bottom > > >> at > > >> > > > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421) > > >> at > > >> > > > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291) > > >> at > > >> > > > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825) > > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > > >> at > > >> > > > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831) > > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > >> at > > >> > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > >> at > > >> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > >> at java.lang.reflect.Method.invoke(Method.java:606)
Re: Completebulkload with namespace option?
I see why, looking at the source code of LoadIncrementalHFiles.java, it seems the temporary path created for splitting will contain ':', The error part should be this: String uniqueName = getUniqueName(table.getName()); HColumnDescriptor familyDesc = table.getTableDescriptor().getFamily(item.family); Path botOut = new Path(tmpDir, uniqueName + ".bottom"); Path topOut = new Path(tmpDir, uniqueName + ".top"); splitStoreFile(getConf(), hfilePath, familyDesc, splitKey, botOut, topOut); uniqueName will be "namespce:table" so new Path will fail. A bug right? Jianshi On Tue, Jul 29, 2014 at 2:42 PM, Jianshi Huang wrote: > I'm using hbase 0.98 with HDP 2.1. > > > On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang > wrote: > >> I'm using completebulkload to load 500GB of data to a table >> (presplitted). However, it reports the following errors: >> >> Looks like completebulkload didn't recognize the namespace part >> (namespace:table). >> >> Is there an option to do it? I can't find one in Google... >> >> Exception in thread "main" 14/07/28 23:32:19 INFO >> mapreduce.LoadIncrementalHFiles: Trying to load >> hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3 >> first=dc595cfe#cust#1812199228741466242 >> last=dc68cedc#cust#2251647837553603393 >> java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212) >> Caused by: java.lang.IllegalStateException: >> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative >> path in absolute URI: grapple:vertices,37.bottom >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421) >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291) >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) >> at >> org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153) >> ... 10 more >> Caused by: java.lang.IllegalArgumentException: >> java.net.URISyntaxException: Relative path in absolute URI: >> grapple:vertices,37.bottom >> at org.apache.hadoop.fs.Path.initialize(Path.java:206) >> at org.apache.hadoop.fs.Path.(Path.java:172) >> at org.apache.hadoop.fs.Path.(Path.java:94) >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.splitStoreFile(LoadIncrementalHFiles.java:450) >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:516) >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:400) >> at >> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:398) >> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(T
Re: Completebulkload with namespace option?
I'm using hbase 0.98 with HDP 2.1. On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang wrote: > I'm using completebulkload to load 500GB of data to a table (presplitted). > However, it reports the following errors: > > Looks like completebulkload didn't recognize the namespace part > (namespace:table). > > Is there an option to do it? I can't find one in Google... > > Exception in thread "main" 14/07/28 23:32:19 INFO > mapreduce.LoadIncrementalHFiles: Trying to load > hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3 > first=dc595cfe#cust#1812199228741466242 > last=dc68cedc#cust#2251647837553603393 > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > Caused by: java.lang.IllegalStateException: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: grapple:vertices,37.bottom > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421) > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291) > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153) > ... 10 more > Caused by: java.lang.IllegalArgumentException: > java.net.URISyntaxException: Relative path in absolute URI: > grapple:vertices,37.bottom > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at org.apache.hadoop.fs.Path.(Path.java:94) > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.splitStoreFile(LoadIncrementalHFiles.java:450) > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:516) > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:400) > at > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:398) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > grapple:vertices,37.bottom > at java.net.URI.checkPath(URI.java:1804) > at java.net.URI.(URI.java:752) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) > ... 10 more > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Completebulkload with namespace option?
I'm using completebulkload to load 500GB of data to a table (presplitted). However, it reports the following errors: Looks like completebulkload didn't recognize the namespace part (namespace:table). Is there an option to do it? I can't find one in Google... Exception in thread "main" 14/07/28 23:32:19 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3 first=dc595cfe#cust#1812199228741466242 last=dc68cedc#cust#2251647837553603393 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.lang.IllegalStateException: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: grapple:vertices,37.bottom at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153) ... 10 more Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: grapple:vertices,37.bottom at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.hadoop.fs.Path.(Path.java:94) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.splitStoreFile(LoadIncrementalHFiles.java:450) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:516) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:400) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:398) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.net.URISyntaxException: Relative path in absolute URI: grapple:vertices,37.bottom at java.net.URI.checkPath(URI.java:1804) at java.net.URI.(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 10 more -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Scan columns of a row within a Range
Yes, I found the info from a nice blog article. Thanks Ted! Jianshi On Thu, Jul 17, 2014 at 10:07 PM, Ted Yu wrote: > ColumnRangeFilter implements getNextCellHint() in facilitating jumping to > the minColumn. > When current column is past maxColumn, it skips to next row. > > So ColumnRangeFilter is very effective. > > Cheers > > > On Thu, Jul 17, 2014 at 12:45 AM, Jianshi Huang > wrote: > > > Hi Esteban, > > > > Yes, I found it moments ago. Is it as efficient as the Row scan? > > > > And can I have millions of columns for a row with no or little > performance > > impaction? (the traditional tall vs wide problem, the hbase manual > > recommends tall table than wide table). > > > > > > Jianshi > > > > > > On Thu, Jul 17, 2014 at 3:01 PM, Esteban Gutierrez > > > wrote: > > > > > Hi Jianshi, > > > > > > Have you looked into the ColumnRangeFilter? > > > > > > > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html > > > > > > cheers, > > > esteban. > > > > > > > > > -- > > > Cloudera, Inc. > > > > > > > > > > > > On Wed, Jul 16, 2014 at 11:40 PM, Jianshi Huang < > jianshi.hu...@gmail.com > > > > > > wrote: > > > > > > > Hi, > > > > > > > > I scanned through HBase' Scan API and couldn't find out how to scan a > > > range > > > > of columns in a row. > > > > > > > > It seems I can only do scan(startRow, endRow), which are both just > > > RowKeys. > > > > > > > > What's the most efficient way to do it? Should I use a Filter? I > heard > > > > filter is not as efficient as RK scans, how much slower is it? > > > > > > > > (BTW, I was using Accumulo for the same thing and it has a really > nice > > > API > > > > (Range, Key) for it. A Key is a combination of RK+CF+CQ+TS.) > > > > > > > > Am I missing anything? > > > > > > > > Cheers, > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Scan columns of a row within a Range
Hi Esteban, Yes, I found it moments ago. Is it as efficient as the Row scan? And can I have millions of columns for a row with no or little performance impaction? (the traditional tall vs wide problem, the hbase manual recommends tall table than wide table). Jianshi On Thu, Jul 17, 2014 at 3:01 PM, Esteban Gutierrez wrote: > Hi Jianshi, > > Have you looked into the ColumnRangeFilter? > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html > > cheers, > esteban. > > > -- > Cloudera, Inc. > > > > On Wed, Jul 16, 2014 at 11:40 PM, Jianshi Huang > wrote: > > > Hi, > > > > I scanned through HBase' Scan API and couldn't find out how to scan a > range > > of columns in a row. > > > > It seems I can only do scan(startRow, endRow), which are both just > RowKeys. > > > > What's the most efficient way to do it? Should I use a Filter? I heard > > filter is not as efficient as RK scans, how much slower is it? > > > > (BTW, I was using Accumulo for the same thing and it has a really nice > API > > (Range, Key) for it. A Key is a combination of RK+CF+CQ+TS.) > > > > Am I missing anything? > > > > Cheers, > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Scan columns of a row within a Range
Hi, I scanned through HBase' Scan API and couldn't find out how to scan a range of columns in a row. It seems I can only do scan(startRow, endRow), which are both just RowKeys. What's the most efficient way to do it? Should I use a Filter? I heard filter is not as efficient as RK scans, how much slower is it? (BTW, I was using Accumulo for the same thing and it has a really nice API (Range, Key) for it. A Key is a combination of RK+CF+CQ+TS.) Am I missing anything? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/