How to set Timeout for get/scan operations without impacting others

2015-05-17 Thread Jianshi Huang
Hi,

I need to set tight timeout for get/scan operations and I think HBase
Client already support it.

I found three related keys:

- hbase.client.operation.timeout
- hbase.rpc.timeout
- hbase.client.retries.number

What's the difference between hbase.client.operation.timeout and
hbase.rpc.timeout?
My understanding is that hbase.rpc.timeout has larger scope than hbase.
client.operation.timeout, so setting hbase.client.operation.timeout  is
safer. Am I correct?

And any other property keys I can uses?

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Need Help: RegionTooBusyException: Above memstore limit

2015-03-03 Thread Jianshi Huang
The error disappeared after changing write buffer from 20MB to 2MB. Thanks
for the help!

Jianshi

On Wed, Mar 4, 2015 at 12:12 AM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> It depends on how you manage your connection, your table and your puts. If
> it works for you with reducing the batch buffer size, then just keep it the
> way it is...
>
> JM
>
> 2015-03-03 11:10 GMT-05:00 Jianshi Huang :
>
> > Yes, looks like reducing the batch buffer size works (still validating).
> >
> > But why setAutoFlush(false) is harmful here? I just want maximum write
> > speed.
> >
> > Jianshi
> >
> > On Tue, Mar 3, 2015 at 10:54 PM, Jean-Marc Spaggiari <
> > jean-m...@spaggiari.org> wrote:
> >
> > > Let HBase manage the flushes for you. Remove
> > edgeTable.setAutoFlush(false)
> > > and maybe reduce your batch size.
> > >
> > > I don't think that increasing the memstore is the good way to go. Sound
> > > more like a plaster on the issue than a good fix (for me).
> > >
> > > JM
> > >
> > > 2015-03-03 9:43 GMT-05:00 Ted Yu :
> > >
> > > > Default value for hbase.regionserver.global.memstore.size is 0.4
> > > >
> > > > Meaning Maximum size of all memstores in the region server before new
> > > > updates
> > > > are blocked and flushes are forced is 7352m which is lower than 774m.
> > > >
> > > > You can increase the value for
> hbase.regionserver.global.memstore.size
> > > >
> > > > Please also see if you can distribute the writes to the underlying
> > region
> > > > so that the region's use of memstore comes down.
> > > >
> > > > Cheersx
> > > >
> > > > On Tue, Mar 3, 2015 at 12:07 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > Only one region server is problematic.
> > > > >
> > > > > hbase.regionserver.global.memstore.size is not set, the problematic
> > > > region
> > > > > is using 774m for memstore.
> > > > >
> > > > > Max heap is 18380m for all region servers.
> > > > >
> > > > > Jianshi
> > > > >
> > > > >
> > > > > On Mon, Mar 2, 2015 at 10:59 PM, Ted Yu 
> wrote:
> > > > >
> > > > > > What's the value for hbase.regionserver.global.memstore.size ?
> > > > > >
> > > > > > Did RegionTooBusyException happen to many regions or only a few
> > > > regions ?
> > > > > >
> > > > > > How much heap did you give region servers ?
> > > > > >
> > > > > > bq. HBase version is 0.98.0.2.1.2.0-402
> > > > > >
> > > > > > Yeah, this is a bit old. Please consider upgrading.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang <
> > > jianshi.hu...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm constantly facing "RegionTooBusyException: Above memstore
> > > limit"
> > > > > > errors
> > > > > > > in one region server when writing data to HBase.
> > > > > > >
> > > > > > > I checked the region server log, and I've seen a lot of
> warnings
> > > > during
> > > > > > the
> > > > > > > data writes:
> > > > > > >
> > > > > > >   WARN wal.fshlog couldn't find oldest seqNum for the region
> > we're
> > > > > about
> > > > > > to
> > > > > > > flush, ...
> > > > > > >
> > > > > > > Then HBase seem to flush the data and added it as a HStore
> file.
> > > > > > >
> > > > > > > I also get a few warnings in client.ShortCircuitCache, says
> > "could
> > > > not
> > > > > > load
> > > > > > > ... due to InvalidToken exceptions.
> > > > > > >
> > > > > > > Anyone can give me hint what went wrong?
> > > > > > >
> > > > > > > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but
> > the
> > > > > > release
> > > > > > > is a little bit old.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > --
> > > > > > > Jianshi Huang
> > > > > > >
> > > > > > > LinkedIn: jianshi
> > > > > > > Twitter: @jshuang
> > > > > > > Github & Blog: http://huangjs.github.com/
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jianshi Huang
> > > > >
> > > > > LinkedIn: jianshi
> > > > > Twitter: @jshuang
> > > > > Github & Blog: http://huangjs.github.com/
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Need Help: RegionTooBusyException: Above memstore limit

2015-03-03 Thread Jianshi Huang
Yes, looks like reducing the batch buffer size works (still validating).

But why setAutoFlush(false) is harmful here? I just want maximum write
speed.

Jianshi

On Tue, Mar 3, 2015 at 10:54 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Let HBase manage the flushes for you. Remove edgeTable.setAutoFlush(false)
> and maybe reduce your batch size.
>
> I don't think that increasing the memstore is the good way to go. Sound
> more like a plaster on the issue than a good fix (for me).
>
> JM
>
> 2015-03-03 9:43 GMT-05:00 Ted Yu :
>
> > Default value for hbase.regionserver.global.memstore.size is 0.4
> >
> > Meaning Maximum size of all memstores in the region server before new
> > updates
> > are blocked and flushes are forced is 7352m which is lower than 774m.
> >
> > You can increase the value for hbase.regionserver.global.memstore.size
> >
> > Please also see if you can distribute the writes to the underlying region
> > so that the region's use of memstore comes down.
> >
> > Cheersx
> >
> > On Tue, Mar 3, 2015 at 12:07 AM, Jianshi Huang 
> > wrote:
> >
> > > Hi Ted,
> > >
> > > Only one region server is problematic.
> > >
> > > hbase.regionserver.global.memstore.size is not set, the problematic
> > region
> > > is using 774m for memstore.
> > >
> > > Max heap is 18380m for all region servers.
> > >
> > > Jianshi
> > >
> > >
> > > On Mon, Mar 2, 2015 at 10:59 PM, Ted Yu  wrote:
> > >
> > > > What's the value for hbase.regionserver.global.memstore.size ?
> > > >
> > > > Did RegionTooBusyException happen to many regions or only a few
> > regions ?
> > > >
> > > > How much heap did you give region servers ?
> > > >
> > > > bq. HBase version is 0.98.0.2.1.2.0-402
> > > >
> > > > Yeah, this is a bit old. Please consider upgrading.
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang <
> jianshi.hu...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm constantly facing "RegionTooBusyException: Above memstore
> limit"
> > > > errors
> > > > > in one region server when writing data to HBase.
> > > > >
> > > > > I checked the region server log, and I've seen a lot of warnings
> > during
> > > > the
> > > > > data writes:
> > > > >
> > > > >   WARN wal.fshlog couldn't find oldest seqNum for the region we're
> > > about
> > > > to
> > > > > flush, ...
> > > > >
> > > > > Then HBase seem to flush the data and added it as a HStore file.
> > > > >
> > > > > I also get a few warnings in client.ShortCircuitCache, says "could
> > not
> > > > load
> > > > > ... due to InvalidToken exceptions.
> > > > >
> > > > > Anyone can give me hint what went wrong?
> > > > >
> > > > > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the
> > > > release
> > > > > is a little bit old.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Jianshi Huang
> > > > >
> > > > > LinkedIn: jianshi
> > > > > Twitter: @jshuang
> > > > > Github & Blog: http://huangjs.github.com/
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Need Help: RegionTooBusyException: Above memstore limit

2015-03-03 Thread Jianshi Huang
Hi JM,

Thanks for the hints. Here's my settings for writer.

edgeTable.setAutoFlush(false)
edgeTable.setWriteBufferSize(20971520)

The write buffer seems quite large as the region server is hosting 12
related regions I'm writing to. I'll test with smaller write buffer size.

The size of each put is between 10k~100k.

Jianshi

On Mon, Mar 2, 2015 at 11:04 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Jo Jianshi,
>
> Are you doing batch of puts? If so, what's the size of the batch and what's
> the size of the puts? Are you trying to give a batch which at the end will
> be bigger than the memstore size for a single RS? Can you try to reduce the
> size of this batch?
>
> JM
>
> 2015-03-02 9:59 GMT-05:00 Ted Yu :
>
> > What's the value for hbase.regionserver.global.memstore.size ?
> >
> > Did RegionTooBusyException happen to many regions or only a few regions ?
> >
> > How much heap did you give region servers ?
> >
> > bq. HBase version is 0.98.0.2.1.2.0-402
> >
> > Yeah, this is a bit old. Please consider upgrading.
> >
> > Cheers
> >
> > On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang 
> > wrote:
> >
> > > Hi,
> > >
> > > I'm constantly facing "RegionTooBusyException: Above memstore limit"
> > errors
> > > in one region server when writing data to HBase.
> > >
> > > I checked the region server log, and I've seen a lot of warnings during
> > the
> > > data writes:
> > >
> > >   WARN wal.fshlog couldn't find oldest seqNum for the region we're
> about
> > to
> > > flush, ...
> > >
> > > Then HBase seem to flush the data and added it as a HStore file.
> > >
> > > I also get a few warnings in client.ShortCircuitCache, says "could not
> > load
> > > ... due to InvalidToken exceptions.
> > >
> > > Anyone can give me hint what went wrong?
> > >
> > > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the
> > release
> > > is a little bit old.
> > >
> > > Thanks,
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Need Help: RegionTooBusyException: Above memstore limit

2015-03-03 Thread Jianshi Huang
Hi Ted,

Only one region server is problematic.

hbase.regionserver.global.memstore.size is not set, the problematic region
is using 774m for memstore.

Max heap is 18380m for all region servers.

Jianshi


On Mon, Mar 2, 2015 at 10:59 PM, Ted Yu  wrote:

> What's the value for hbase.regionserver.global.memstore.size ?
>
> Did RegionTooBusyException happen to many regions or only a few regions ?
>
> How much heap did you give region servers ?
>
> bq. HBase version is 0.98.0.2.1.2.0-402
>
> Yeah, this is a bit old. Please consider upgrading.
>
> Cheers
>
> On Mon, Mar 2, 2015 at 1:42 AM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > I'm constantly facing "RegionTooBusyException: Above memstore limit"
> errors
> > in one region server when writing data to HBase.
> >
> > I checked the region server log, and I've seen a lot of warnings during
> the
> > data writes:
> >
> >   WARN wal.fshlog couldn't find oldest seqNum for the region we're about
> to
> > flush, ...
> >
> > Then HBase seem to flush the data and added it as a HStore file.
> >
> > I also get a few warnings in client.ShortCircuitCache, says "could not
> load
> > ... due to InvalidToken exceptions.
> >
> > Anyone can give me hint what went wrong?
> >
> > My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the
> release
> > is a little bit old.
> >
> > Thanks,
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Need Help: RegionTooBusyException: Above memstore limit

2015-03-02 Thread Jianshi Huang
Hi,

I'm constantly facing "RegionTooBusyException: Above memstore limit" errors
in one region server when writing data to HBase.

I checked the region server log, and I've seen a lot of warnings during the
data writes:

  WARN wal.fshlog couldn't find oldest seqNum for the region we're about to
flush, ...

Then HBase seem to flush the data and added it as a HStore file.

I also get a few warnings in client.ShortCircuitCache, says "could not load
... due to InvalidToken exceptions.

Anyone can give me hint what went wrong?

My HBase version is 0.98.0.2.1.2.0-402, I'm using HDP 2.1, but the release
is a little bit old.

Thanks,

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: setCompactionEnabled(false) seems ignored by HBase (0.98)

2015-01-06 Thread Jianshi Huang
Thanks Stack,

Will check the logs for reason. I'm only disabling compaction during
dynamic splits (~10 mins), so it's acceptable in my case.

Thanks,
Jianshi

On Wed, Jan 7, 2015 at 1:37 AM, Stack  wrote:

> On Mon, Jan 5, 2015 at 11:00 PM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > Firstly, I found it strange that when I added a new split to a table and
> do
> > admin.move, it will trigger a MAJOR compaction for the whole table.
> >
>
> Usually, a compactions says what provoked it in the log and why it a major
> compaction.
>
> Splits and moves are not hooked up to force a major compaction so check
> logs to see what brought on the compaction.  Rather than wholesale disable
> compactions -- probably a bad idea -- you are probably better off trying to
> tune what triggers compactions in your workload.
>
> St.Ack
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: setCompactionEnabled(false) seems ignored by HBase (0.98)

2015-01-06 Thread Jianshi Huang
Ah! I need run

  admin.modifyTable(tableNameBytes, tableDescriptor)

Will try it soon...

Jianshi

On Tue, Jan 6, 2015 at 11:12 PM, Ted Yu  wrote:

> This is what setCompactionEnabled() does:
>
>   public HTableDescriptor setCompactionEnabled(final boolean isEnable) {
>
> setValue(COMPACTION_ENABLED_KEY, isEnable ? TRUE : FALSE);
>
> return this;
>
> FYI
>
> On Mon, Jan 5, 2015 at 11:00 PM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > Firstly, I found it strange that when I added a new split to a table and
> do
> > admin.move, it will trigger a MAJOR compaction for the whole table.
> >
> > So I tried to disable compaction before adding splits,
> >
> >
> > admin.getTableDescriptor(tableNameBytes).setCompactionEnabled(false)
> >
> > However, MAJOR compaction is still triggered, looks like the flag is
> > ignored by HBase? Do I need to (have to) disable the table first?
> >
> > Cheers,
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


setCompactionEnabled(false) seems ignored by HBase (0.98)

2015-01-05 Thread Jianshi Huang
Hi,

Firstly, I found it strange that when I added a new split to a table and do
admin.move, it will trigger a MAJOR compaction for the whole table.

So I tried to disable compaction before adding splits,

admin.getTableDescriptor(tableNameBytes).setCompactionEnabled(false)

However, MAJOR compaction is still triggered, looks like the flag is
ignored by HBase? Do I need to (have to) disable the table first?

Cheers,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-13 Thread Jianshi Huang
Oh, that article, I've read that before. I'm using the approach that using
a single KV to hold all my columns (mostly readonly).

So conclusion: saving in disk space is not that huge

one HBase column per colomn:

1,350,483

1000

SNAPPY

DIFF
vs one HBase column for all columns:

1,119,330

1000

SNAPPY

DIFF
Only about 15%

However, the article suggested that the saving over the network wire is
huge.

6,293,670

1000

NONE

NONE
vs

1,374,465

1000

NONE

NONE


Thanks again for the help!

Jianshi

On Fri, Nov 14, 2014 at 12:12 PM, Ted Yu  wrote:

> w.r.t. the effect of data block encoding on HFile size, take a look at Doug
> Meil's blog 'The Effect of ColumnFamily, RowKey and KeyValue Design on
> HFile Size':
> http://blogs.apache.org/hbase/
>
> Cheers
>
> On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang 
> wrote:
>
> > Thanks Ram,
> >
> > How about Prefix Tree based encoding then? HBASE-4676
> > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also
> possible
> > to do suffix tries? Then it could be a nice fit for JSON String (or any
> > long value where changes are small).
> >
> > Maybe I should just flatten JSON to columns, hmm...what's the overhead
> for
> > a column?
> >
> > Jianshi
> >
> > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com> wrote:
> >
> > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for
> > > value
> > > cell?
> > > No that is not possible now. All the encoding is per KV only.
> > > But what you say is definitely worth trying.
> > >
> > > >>So would you recommend storing JSON flattened as many columns?
> > > May be yes.  But I have practically not used JSON formats so I may not
> be
> > > the best person to comment on this.
> > >
> > > Regards
> > > Ram
> > >
> > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Ram,
> > > >
> > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF for
> > > value
> > > > cell?
> > > >
> > > > So would you recommend storing JSON flattened as many columns?
> > > >
> > > > Jianshi
> > > >
> > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > >> Since I'm storing
> > > > > historical data (snapshot data) and changes between adjacent value
> > > cells
> > > > > are relatively small.
> > > > >
> > > > > If the values are changing even if it is smaller the FASTDIFF will
> > > > rewrite
> > > > > the value part.  Only if there are exact matches then it would skip
> > the
> > > > > value part. JFYI.
> > > > >
> > > > > Regards
> > > > > Ram
> > > > >
> > > > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang <
> > > jianshi.hu...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I thought FASTDIFF was only for rowkey and columns, great if it
> > also
> > > > > works
> > > > > > in value cell.
> > > > > >
> > > > > > And thanks for the bjson link!
> > > > > >
> > > > > > Jianshi
> > > > > >
> > > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu 
> > wrote:
> > > > > >
> > > > > > > There is FASTDIFF data block encoding.
> > > > > > >
> > > > > > > See also http://bjson.org/
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang <
> > > jianshi.hu...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm currently saving JSON in pure String format in the value
> > cell
> > > > and
> > > > > > > > depends on HBase' block compression to reduce the overhead of
> > > JSON.
> > > > > > &

Re: Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-13 Thread Jianshi Huang
But HDP 2.2 uses HDFS 2.6.0... very hard to convince our admins to upgrade.

Would you recommend us to upgrade to 2.6.0? I'll ask them to consult HWX if
you say yes. :)

Jianshi

On Fri, Nov 14, 2014 at 9:42 AM, Ted Yu  wrote:

> No.
> The upcoming HDP 2.2 does have that fix.
>
> Cheers
>
> On Thu, Nov 13, 2014 at 5:38 PM, Jianshi Huang 
> wrote:
>
> > Oh, btw, is latest HDP 2.1(0.98.0.2.1.7.0-784-hadoop2) have this fix?
> >
> > Jianshi
> >
> > On Fri, Nov 14, 2014 at 9:37 AM, Jianshi Huang 
> > wrote:
> >
> > > Thanks Ted.
> > >
> > > I think the fix you mentioned is this one HBASE-12078
> > > <https://issues.apache.org/jira/browse/HBASE-12078>.
> > >
> > > Not sure when our Hadoop admin would upgrade it, ahhh
> > >
> > > Jianshi
> > >
> > > On Thu, Nov 13, 2014 at 11:15 PM, Ted Yu  wrote:
> > >
> > >> Keep in mind that Prefix Tree encoding has higher overhead in write
> path
> > >> compared to other data block encoding methods.
> > >>
> > >> Please use 0.98.7 which has the latest fixes for Prefix Tree encoding.
> > >>
> > >> Cheers
> > >>
> > >> On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang <
> jianshi.hu...@gmail.com
> > >
> > >> wrote:
> > >>
> > >> > Thanks Ram,
> > >> >
> > >> > How about Prefix Tree based encoding then? HBASE-4676
> > >> > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also
> > >> possible
> > >> > to do suffix tries? Then it could be a nice fit for JSON String (or
> > any
> > >> > long value where changes are small).
> > >> >
> > >> > Maybe I should just flatten JSON to columns, hmm...what's the
> overhead
> > >> for
> > >> > a column?
> > >> >
> > >> > Jianshi
> > >> >
> > >> > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan <
> > >> > ramkrishna.s.vasude...@gmail.com> wrote:
> > >> >
> > >> > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF
> > for
> > >> > > value
> > >> > > cell?
> > >> > > No that is not possible now. All the encoding is per KV only.
> > >> > > But what you say is definitely worth trying.
> > >> > >
> > >> > > >>So would you recommend storing JSON flattened as many columns?
> > >> > > May be yes.  But I have practically not used JSON formats so I may
> > >> not be
> > >> > > the best person to comment on this.
> > >> > >
> > >> > > Regards
> > >> > > Ram
> > >> > >
> > >> > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang <
> > >> jianshi.hu...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks Ram,
> > >> > > >
> > >> > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF
> > for
> > >> > > value
> > >> > > > cell?
> > >> > > >
> > >> > > > So would you recommend storing JSON flattened as many columns?
> > >> > > >
> > >> > > > Jianshi
> > >> > > >
> > >> > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan <
> > >> > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > >> > > >
> > >> > > > > Hi
> > >> > > > >
> > >> > > > > >> Since I'm storing
> > >> > > > > historical data (snapshot data) and changes between adjacent
> > value
> > >> > > cells
> > >> > > > > are relatively small.
> > >> > > > >
> > >> > > > > If the values are changing even if it is smaller the FASTDIFF
> > will
> > >> > > > rewrite
> > >> > > > > the value part.  Only if there are exact matches then it would
> > >> skip
> > >> > the
> > >> > > > > value part. JFYI.
> > >> > > > >
> > >> > > > > Regards
> > >> > > > > Ram
> > >> > > > &g

Re: Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-13 Thread Jianshi Huang
Oh, btw, is latest HDP 2.1(0.98.0.2.1.7.0-784-hadoop2) have this fix?

Jianshi

On Fri, Nov 14, 2014 at 9:37 AM, Jianshi Huang 
wrote:

> Thanks Ted.
>
> I think the fix you mentioned is this one HBASE-12078
> <https://issues.apache.org/jira/browse/HBASE-12078>.
>
> Not sure when our Hadoop admin would upgrade it, ahhh
>
> Jianshi
>
> On Thu, Nov 13, 2014 at 11:15 PM, Ted Yu  wrote:
>
>> Keep in mind that Prefix Tree encoding has higher overhead in write path
>> compared to other data block encoding methods.
>>
>> Please use 0.98.7 which has the latest fixes for Prefix Tree encoding.
>>
>> Cheers
>>
>> On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang 
>> wrote:
>>
>> > Thanks Ram,
>> >
>> > How about Prefix Tree based encoding then? HBASE-4676
>> > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also
>> possible
>> > to do suffix tries? Then it could be a nice fit for JSON String (or any
>> > long value where changes are small).
>> >
>> > Maybe I should just flatten JSON to columns, hmm...what's the overhead
>> for
>> > a column?
>> >
>> > Jianshi
>> >
>> > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan <
>> > ramkrishna.s.vasude...@gmail.com> wrote:
>> >
>> > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for
>> > > value
>> > > cell?
>> > > No that is not possible now. All the encoding is per KV only.
>> > > But what you say is definitely worth trying.
>> > >
>> > > >>So would you recommend storing JSON flattened as many columns?
>> > > May be yes.  But I have practically not used JSON formats so I may
>> not be
>> > > the best person to comment on this.
>> > >
>> > > Regards
>> > > Ram
>> > >
>> > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang <
>> jianshi.hu...@gmail.com>
>> > > wrote:
>> > >
>> > > > Thanks Ram,
>> > > >
>> > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF for
>> > > value
>> > > > cell?
>> > > >
>> > > > So would you recommend storing JSON flattened as many columns?
>> > > >
>> > > > Jianshi
>> > > >
>> > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan <
>> > > > ramkrishna.s.vasude...@gmail.com> wrote:
>> > > >
>> > > > > Hi
>> > > > >
>> > > > > >> Since I'm storing
>> > > > > historical data (snapshot data) and changes between adjacent value
>> > > cells
>> > > > > are relatively small.
>> > > > >
>> > > > > If the values are changing even if it is smaller the FASTDIFF will
>> > > > rewrite
>> > > > > the value part.  Only if there are exact matches then it would
>> skip
>> > the
>> > > > > value part. JFYI.
>> > > > >
>> > > > > Regards
>> > > > > Ram
>> > > > >
>> > > > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang <
>> > > jianshi.hu...@gmail.com
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > I thought FASTDIFF was only for rowkey and columns, great if it
>> > also
>> > > > > works
>> > > > > > in value cell.
>> > > > > >
>> > > > > > And thanks for the bjson link!
>> > > > > >
>> > > > > > Jianshi
>> > > > > >
>> > > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu 
>> > wrote:
>> > > > > >
>> > > > > > > There is FASTDIFF data block encoding.
>> > > > > > >
>> > > > > > > See also http://bjson.org/
>> > > > > > >
>> > > > > > > Cheers
>> > > > > > >
>> > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang <
>> > > jianshi.hu...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > 

Re: Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-13 Thread Jianshi Huang
Thanks Ted.

I think the fix you mentioned is this one HBASE-12078
<https://issues.apache.org/jira/browse/HBASE-12078>.

Not sure when our Hadoop admin would upgrade it, ahhh

Jianshi

On Thu, Nov 13, 2014 at 11:15 PM, Ted Yu  wrote:

> Keep in mind that Prefix Tree encoding has higher overhead in write path
> compared to other data block encoding methods.
>
> Please use 0.98.7 which has the latest fixes for Prefix Tree encoding.
>
> Cheers
>
> On Thu, Nov 13, 2014 at 1:27 AM, Jianshi Huang 
> wrote:
>
> > Thanks Ram,
> >
> > How about Prefix Tree based encoding then? HBASE-4676
> > <https://issues.apache.org/jira/browse/HBASE-4676> says it's also
> possible
> > to do suffix tries? Then it could be a nice fit for JSON String (or any
> > long value where changes are small).
> >
> > Maybe I should just flatten JSON to columns, hmm...what's the overhead
> for
> > a column?
> >
> > Jianshi
> >
> > On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com> wrote:
> >
> > > >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for
> > > value
> > > cell?
> > > No that is not possible now. All the encoding is per KV only.
> > > But what you say is definitely worth trying.
> > >
> > > >>So would you recommend storing JSON flattened as many columns?
> > > May be yes.  But I have practically not used JSON formats so I may not
> be
> > > the best person to comment on this.
> > >
> > > Regards
> > > Ram
> > >
> > > On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Ram,
> > > >
> > > > So is it possible to specify FASTDIFF for rowkey/column and DIFF for
> > > value
> > > > cell?
> > > >
> > > > So would you recommend storing JSON flattened as many columns?
> > > >
> > > > Jianshi
> > > >
> > > > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > >> Since I'm storing
> > > > > historical data (snapshot data) and changes between adjacent value
> > > cells
> > > > > are relatively small.
> > > > >
> > > > > If the values are changing even if it is smaller the FASTDIFF will
> > > > rewrite
> > > > > the value part.  Only if there are exact matches then it would skip
> > the
> > > > > value part. JFYI.
> > > > >
> > > > > Regards
> > > > > Ram
> > > > >
> > > > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang <
> > > jianshi.hu...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I thought FASTDIFF was only for rowkey and columns, great if it
> > also
> > > > > works
> > > > > > in value cell.
> > > > > >
> > > > > > And thanks for the bjson link!
> > > > > >
> > > > > > Jianshi
> > > > > >
> > > > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu 
> > wrote:
> > > > > >
> > > > > > > There is FASTDIFF data block encoding.
> > > > > > >
> > > > > > > See also http://bjson.org/
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang <
> > > jianshi.hu...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm currently saving JSON in pure String format in the value
> > cell
> > > > and
> > > > > > > > depends on HBase' block compression to reduce the overhead of
> > > JSON.
> > > > > > > >
> > > > > > > > I'm wondering if there's a more space efficient way to store
> > > JSON?
> > > > > > > > (there're lots of 0s and 1s, JSON String actually is an OK
> > > format)
> > > > > > > >
> > > > > > > > I want to keep the value as a Map since the schema of source
> > data
> > > > > might
> > > > > > > > change over time.
> > > > > > > >
> > > > > > > > Also is there a DIFF based encoding for values? Since I'm
> > storing
> > > > > > > > historical data (snapshot data) and changes between adjacent
> > > value
> > > > > > cells
> > > > > > > > are relatively small.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > --
> > > > > > > > Jianshi Huang
> > > > > > > >
> > > > > > > > LinkedIn: jianshi
> > > > > > > > Twitter: @jshuang
> > > > > > > > Github & Blog: http://huangjs.github.com/
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jianshi Huang
> > > > > >
> > > > > > LinkedIn: jianshi
> > > > > > Twitter: @jshuang
> > > > > > Github & Blog: http://huangjs.github.com/
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-13 Thread Jianshi Huang
Thanks Ram,

How about Prefix Tree based encoding then? HBASE-4676
<https://issues.apache.org/jira/browse/HBASE-4676> says it's also possible
to do suffix tries? Then it could be a nice fit for JSON String (or any
long value where changes are small).

Maybe I should just flatten JSON to columns, hmm...what's the overhead for
a column?

Jianshi

On Thu, Nov 13, 2014 at 4:49 PM, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> >>So is it possible to specify FASTDIFF for rowkey/column and DIFF for
> value
> cell?
> No that is not possible now. All the encoding is per KV only.
> But what you say is definitely worth trying.
>
> >>So would you recommend storing JSON flattened as many columns?
> May be yes.  But I have practically not used JSON formats so I may not be
> the best person to comment on this.
>
> Regards
> Ram
>
> On Thu, Nov 13, 2014 at 2:01 PM, Jianshi Huang 
> wrote:
>
> > Thanks Ram,
> >
> > So is it possible to specify FASTDIFF for rowkey/column and DIFF for
> value
> > cell?
> >
> > So would you recommend storing JSON flattened as many columns?
> >
> > Jianshi
> >
> > On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > >> Since I'm storing
> > > historical data (snapshot data) and changes between adjacent value
> cells
> > > are relatively small.
> > >
> > > If the values are changing even if it is smaller the FASTDIFF will
> > rewrite
> > > the value part.  Only if there are exact matches then it would skip the
> > > value part. JFYI.
> > >
> > > Regards
> > > Ram
> > >
> > > On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang <
> jianshi.hu...@gmail.com
> > >
> > > wrote:
> > >
> > > > I thought FASTDIFF was only for rowkey and columns, great if it also
> > > works
> > > > in value cell.
> > > >
> > > > And thanks for the bjson link!
> > > >
> > > > Jianshi
> > > >
> > > > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu  wrote:
> > > >
> > > > > There is FASTDIFF data block encoding.
> > > > >
> > > > > See also http://bjson.org/
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm currently saving JSON in pure String format in the value cell
> > and
> > > > > > depends on HBase' block compression to reduce the overhead of
> JSON.
> > > > > >
> > > > > > I'm wondering if there's a more space efficient way to store
> JSON?
> > > > > > (there're lots of 0s and 1s, JSON String actually is an OK
> format)
> > > > > >
> > > > > > I want to keep the value as a Map since the schema of source data
> > > might
> > > > > > change over time.
> > > > > >
> > > > > > Also is there a DIFF based encoding for values? Since I'm storing
> > > > > > historical data (snapshot data) and changes between adjacent
> value
> > > > cells
> > > > > > are relatively small.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > Jianshi Huang
> > > > > >
> > > > > > LinkedIn: jianshi
> > > > > > Twitter: @jshuang
> > > > > > Github & Blog: http://huangjs.github.com/
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-13 Thread Jianshi Huang
Thanks Ram,

So is it possible to specify FASTDIFF for rowkey/column and DIFF for value
cell?

So would you recommend storing JSON flattened as many columns?

Jianshi

On Thu, Nov 13, 2014 at 2:08 PM, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> Hi
>
> >> Since I'm storing
> historical data (snapshot data) and changes between adjacent value cells
> are relatively small.
>
> If the values are changing even if it is smaller the FASTDIFF will rewrite
> the value part.  Only if there are exact matches then it would skip the
> value part. JFYI.
>
> Regards
> Ram
>
> On Thu, Nov 13, 2014 at 11:23 AM, Jianshi Huang 
> wrote:
>
> > I thought FASTDIFF was only for rowkey and columns, great if it also
> works
> > in value cell.
> >
> > And thanks for the bjson link!
> >
> > Jianshi
> >
> > On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu  wrote:
> >
> > > There is FASTDIFF data block encoding.
> > >
> > > See also http://bjson.org/
> > >
> > > Cheers
> > >
> > > On Nov 12, 2014, at 9:08 PM, Jianshi Huang 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm currently saving JSON in pure String format in the value cell and
> > > > depends on HBase' block compression to reduce the overhead of JSON.
> > > >
> > > > I'm wondering if there's a more space efficient way to store JSON?
> > > > (there're lots of 0s and 1s, JSON String actually is an OK format)
> > > >
> > > > I want to keep the value as a Map since the schema of source data
> might
> > > > change over time.
> > > >
> > > > Also is there a DIFF based encoding for values? Since I'm storing
> > > > historical data (snapshot data) and changes between adjacent value
> > cells
> > > > are relatively small.
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-12 Thread Jianshi Huang
I thought FASTDIFF was only for rowkey and columns, great if it also works
in value cell.

And thanks for the bjson link!

Jianshi

On Thu, Nov 13, 2014 at 1:18 PM, Ted Yu  wrote:

> There is FASTDIFF data block encoding.
>
> See also http://bjson.org/
>
> Cheers
>
> On Nov 12, 2014, at 9:08 PM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > I'm currently saving JSON in pure String format in the value cell and
> > depends on HBase' block compression to reduce the overhead of JSON.
> >
> > I'm wondering if there's a more space efficient way to store JSON?
> > (there're lots of 0s and 1s, JSON String actually is an OK format)
> >
> > I want to keep the value as a Map since the schema of source data might
> > change over time.
> >
> > Also is there a DIFF based encoding for values? Since I'm storing
> > historical data (snapshot data) and changes between adjacent value cells
> > are relatively small.
> >
> >
> > Thanks,
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Storing JSON in HBase value cell, which serialization format is most compact?

2014-11-12 Thread Jianshi Huang
Hi,

I'm currently saving JSON in pure String format in the value cell and
depends on HBase' block compression to reduce the overhead of JSON.

I'm wondering if there's a more space efficient way to store JSON?
(there're lots of 0s and 1s, JSON String actually is an OK format)

I want to keep the value as a Map since the schema of source data might
change over time.

Also is there a DIFF based encoding for values? Since I'm storing
historical data (snapshot data) and changes between adjacent value cells
are relatively small.


Thanks,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Is there a TableInputFormat implementation that supports multiple splits for each region

2014-10-15 Thread Jianshi Huang
It seems each region is a split in current TableInputFormat. We have large
regions and it's suboptimal.

Is there a TableInputFormat implementation that supports multiple splits
for each region?


Thanks,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-22 Thread Jianshi Huang
Thanks for the explanation Qian!

I think being able to balance empty regions is important and the preferred
result to me.

The best way so far is to manually 'balance' the regions if we need to add
pre-splits dynamically.


Jianshi



On Tue, Sep 23, 2014 at 11:35 AM, Qiang Tian  wrote:

> Hello, I happened to got balancer related issues 2 months ago and looked at
> that part, below is a summary:
> 1)by default, hbase balancer(StochasticLoadBalancer by default) does not
> balance regions per table. i.e. all regions are considered as 1 table.  so
> if you have many tables, especially some tables have empty regions, you
> probably get unbalanced, the balancer probably not triggered at all.
> this is got from code inspection, my problem failed to be reproduced later.
> but it proved that deleting empty regions can trigger balancer correctly
> and make regions well balanced.
>
> 2)there are some other reasons that balancer are not triggered. see
> HMaster#balance. turn on debug can see related messages in master log. in
> my case, it is not triggered because there are regions in transition:
> LOG.debug("Not running balancer because " + regionsInTransition.size() +
>   " region(s) in transition: " +
> org.apache.commons.lang.StringUtils.
> abbreviate(regionsInTransition.toString(), 256));
>
> the cause can be found in regionserver log file.
>
> 3)per-table balance can be set by "hbase.master.loadbalance.bytable",
> however it looks not a good option when you have many tables - the master
> will issue balance call for each table, one by one.
>
> 4)split region follows normal balancer process. so if you have issue in #1,
> split does not help balance.  it looks pre-split at table creation is fine,
> which uses round-robin assignment.
>
>
>
> On Tue, Sep 23, 2014 at 2:12 AM, Bharath Vissapragada <
> bhara...@cloudera.com
> > wrote:
>
> > https://issues.apache.org/jira/browse/HBASE-11368 related to the
> original
> > issue too.
> >
> > On Mon, Sep 22, 2014 at 10:18 AM, Ted Yu  wrote:
> >
> > > As you noted in the FIXME, there're some factors which should be
> tackled
> > by
> > > balancer / assignment manager.
> > >
> > > Please continue digging up master log so that we can find the cause for
> > > balancer not fulfilling your goal.
> > >
> > > Cheers
> > >
> > > On Mon, Sep 22, 2014 at 10:09 AM, Jianshi Huang <
> jianshi.hu...@gmail.com
> > >
> > > wrote:
> > >
> > > > Ok, I fixed this by manually reassign region servers to newly created
> > > ones.
> > > >
> > > >   def reassignRegionServer(admin: HBaseAdmin, regions:
> > Seq[HRegionInfo],
> > > > regionServers: Seq[ServerName]): Unit = {
> > > > val rand = new Random()
> > > > regions.foreach { r =>
> > > >   val idx = rand.nextInt(regionServers.size)
> > > >   val server = regionServers(idx)
> > > >   // FIXME: what if selected region server is dead?
> > > > admin.move(r.getEncodedNameAsBytes,
> > > > server.getServerName.getBytes("UTF8"))
> > > > }
> > > >   }
> > > >
> > > > er...
> > > >
> > > > Jianshi
> > > >
> > > > On Tue, Sep 23, 2014 at 12:24 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hmm...any workaround? I only want to do this:
> > > > >
> > > > > Rebalance the new regions *evenly* to all servers after manually
> > adding
> > > > > splits, so later bulk insertions won't cause contention.
> > > > >
> > > > > P.S.
> > > > > Looks like two of the region servers which had majority of the
> > regions
> > > > > were down during Major compaction... I guess it had too much data.
> > > > >
> > > > >
> > > > > Jianshi
> > > > >
> > > > > On Tue, Sep 23, 2014 at 12:13 AM, Jianshi Huang <
> > > jianshi.hu...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Yes, I have access to Master UI, however logs/*.log cannot be
> opened
> > > or
> > > > >> downloaded, must be some security restrictions in the proxy...
> > > > >>
> > > > >> Jianshi
> > > > >>
> > > > >> On Tue, Sep 23, 

Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-22 Thread Jianshi Huang
Ok, I fixed this by manually reassign region servers to newly created ones.

  def reassignRegionServer(admin: HBaseAdmin, regions: Seq[HRegionInfo],
regionServers: Seq[ServerName]): Unit = {
val rand = new Random()
regions.foreach { r =>
  val idx = rand.nextInt(regionServers.size)
  val server = regionServers(idx)
  // FIXME: what if selected region server is dead?
admin.move(r.getEncodedNameAsBytes,
server.getServerName.getBytes("UTF8"))
}
  }

er...

Jianshi

On Tue, Sep 23, 2014 at 12:24 AM, Jianshi Huang 
wrote:

> Hmm...any workaround? I only want to do this:
>
> Rebalance the new regions *evenly* to all servers after manually adding
> splits, so later bulk insertions won't cause contention.
>
> P.S.
> Looks like two of the region servers which had majority of the regions
> were down during Major compaction... I guess it had too much data.
>
>
> Jianshi
>
> On Tue, Sep 23, 2014 at 12:13 AM, Jianshi Huang 
> wrote:
>
>> Yes, I have access to Master UI, however logs/*.log cannot be opened or
>> downloaded, must be some security restrictions in the proxy...
>>
>> Jianshi
>>
>> On Tue, Sep 23, 2014 at 12:06 AM, Ted Yu  wrote:
>>
>>> Do you have access to Master UI ?
>>>
>>> :60010/logs/ would show you list of log files.
>>>
>>> The you can view :60010/logs/hbase--master-XXX.log
>>>
>>> Cheers
>>>
>>> On Mon, Sep 22, 2014 at 9:00 AM, Jianshi Huang 
>>> wrote:
>>>
>>> > Ah... I don't have access to HMaster logs... I need to ask the admin.
>>> >
>>> > Jianshi
>>> >
>>> > On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu  wrote:
>>> >
>>> > > bq. assign per-table balancer class
>>> > >
>>> > > No that I know of.
>>> > > Can you pastebin master log involving output from balancer ?
>>> > >
>>> > > Cheers
>>> > >
>>> > > On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang <
>>> jianshi.hu...@gmail.com>
>>> > > wrote:
>>> > >
>>> > > > Hi Ted,
>>> > > >
>>> > > > I moved setBalancerRunning before balancer and run them twice.
>>> However
>>> > I
>>> > > > still got highly skewed region distribution.
>>> > > >
>>> > > > I guess it's because of the StochasticLoadBalancer, can I assign
>>> > > per-table
>>> > > > balancer class in HBase?
>>> > > >
>>> > > >
>>> > > > Jianshi
>>> > > >
>>> > > > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu 
>>> wrote:
>>> > > >
>>> > > > > admin.setBalancerRunning() call should precede the call to
>>> > > > > admin.balancer().
>>> > > > >
>>> > > > > You can inspect master log to see whether regions are being
>>> moved off
>>> > > the
>>> > > > > heavily loaded server.
>>> > > > >
>>> > > > > Cheers
>>> > > > >
>>> > > > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang <
>>> > > jianshi.hu...@gmail.com>
>>> > > > > wrote:
>>> > > > >
>>> > > > > > Hi Ted and others,
>>> > > > > >
>>> > > > > > I did the following after adding splits (without data) to my
>>> table,
>>> > > > > however
>>> > > > > > the region is still very imbalanced (one region server has 221
>>> > > regions
>>> > > > > and
>>> > > > > > other 50 region servers have about 4~8 regions each).
>>> > > > > >
>>> > > > > >   admin.balancer()
>>> > > > > >   admin.setBalancerRunning(true, true)
>>> > > > > >
>>> > > > > > The balancer class in my HBase cluster is
>>> > > > > >
>>> > > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
>>> > > > > >
>>> > > > > > So, is this behavior expected? Can I assign different balancer
>>> > class
>>> > > to
>>> > > > > my
>>> > > > > > tables (I don't h

Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-22 Thread Jianshi Huang
Hmm...any workaround? I only want to do this:

Rebalance the new regions *evenly* to all servers after manually adding
splits, so later bulk insertions won't cause contention.

P.S.
Looks like two of the region servers which had majority of the regions were
down during Major compaction... I guess it had too much data.


Jianshi

On Tue, Sep 23, 2014 at 12:13 AM, Jianshi Huang 
wrote:

> Yes, I have access to Master UI, however logs/*.log cannot be opened or
> downloaded, must be some security restrictions in the proxy...
>
> Jianshi
>
> On Tue, Sep 23, 2014 at 12:06 AM, Ted Yu  wrote:
>
>> Do you have access to Master UI ?
>>
>> :60010/logs/ would show you list of log files.
>>
>> The you can view :60010/logs/hbase--master-XXX.log
>>
>> Cheers
>>
>> On Mon, Sep 22, 2014 at 9:00 AM, Jianshi Huang 
>> wrote:
>>
>> > Ah... I don't have access to HMaster logs... I need to ask the admin.
>> >
>> > Jianshi
>> >
>> > On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu  wrote:
>> >
>> > > bq. assign per-table balancer class
>> > >
>> > > No that I know of.
>> > > Can you pastebin master log involving output from balancer ?
>> > >
>> > > Cheers
>> > >
>> > > On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang <
>> jianshi.hu...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Ted,
>> > > >
>> > > > I moved setBalancerRunning before balancer and run them twice.
>> However
>> > I
>> > > > still got highly skewed region distribution.
>> > > >
>> > > > I guess it's because of the StochasticLoadBalancer, can I assign
>> > > per-table
>> > > > balancer class in HBase?
>> > > >
>> > > >
>> > > > Jianshi
>> > > >
>> > > > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu 
>> wrote:
>> > > >
>> > > > > admin.setBalancerRunning() call should precede the call to
>> > > > > admin.balancer().
>> > > > >
>> > > > > You can inspect master log to see whether regions are being moved
>> off
>> > > the
>> > > > > heavily loaded server.
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang <
>> > > jianshi.hu...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Ted and others,
>> > > > > >
>> > > > > > I did the following after adding splits (without data) to my
>> table,
>> > > > > however
>> > > > > > the region is still very imbalanced (one region server has 221
>> > > regions
>> > > > > and
>> > > > > > other 50 region servers have about 4~8 regions each).
>> > > > > >
>> > > > > >   admin.balancer()
>> > > > > >   admin.setBalancerRunning(true, true)
>> > > > > >
>> > > > > > The balancer class in my HBase cluster is
>> > > > > >
>> > > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
>> > > > > >
>> > > > > > So, is this behavior expected? Can I assign different balancer
>> > class
>> > > to
>> > > > > my
>> > > > > > tables (I don't have HBase admin permission)? Which one should I
>> > use?
>> > > > > >
>> > > > > > I just want HBase to evenly distribute the regions even they
>> don't
>> > > have
>> > > > > > data (that's the purpose of pre-split I think).
>> > > > > >
>> > > > > >
>> > > > > > Jianshi
>> > > > > >
>> > > > > >
>> > > > > > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu 
>> > wrote:
>> > > > > >
>> > > > > > > Yes. See the following method in HBaseAdmin:
>> > > > > > >
>> > > > > > >   public boolean balancer()
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang <
>&g

Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-22 Thread Jianshi Huang
Yes, I have access to Master UI, however logs/*.log cannot be opened or
downloaded, must be some security restrictions in the proxy...

Jianshi

On Tue, Sep 23, 2014 at 12:06 AM, Ted Yu  wrote:

> Do you have access to Master UI ?
>
> :60010/logs/ would show you list of log files.
>
> The you can view :60010/logs/hbase--master-XXX.log
>
> Cheers
>
> On Mon, Sep 22, 2014 at 9:00 AM, Jianshi Huang 
> wrote:
>
> > Ah... I don't have access to HMaster logs... I need to ask the admin.
> >
> > Jianshi
> >
> > On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu  wrote:
> >
> > > bq. assign per-table balancer class
> > >
> > > No that I know of.
> > > Can you pastebin master log involving output from balancer ?
> > >
> > > Cheers
> > >
> > > On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > I moved setBalancerRunning before balancer and run them twice.
> However
> > I
> > > > still got highly skewed region distribution.
> > > >
> > > > I guess it's because of the StochasticLoadBalancer, can I assign
> > > per-table
> > > > balancer class in HBase?
> > > >
> > > >
> > > > Jianshi
> > > >
> > > > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu  wrote:
> > > >
> > > > > admin.setBalancerRunning() call should precede the call to
> > > > > admin.balancer().
> > > > >
> > > > > You can inspect master log to see whether regions are being moved
> off
> > > the
> > > > > heavily loaded server.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang <
> > > jianshi.hu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Ted and others,
> > > > > >
> > > > > > I did the following after adding splits (without data) to my
> table,
> > > > > however
> > > > > > the region is still very imbalanced (one region server has 221
> > > regions
> > > > > and
> > > > > > other 50 region servers have about 4~8 regions each).
> > > > > >
> > > > > >   admin.balancer()
> > > > > >   admin.setBalancerRunning(true, true)
> > > > > >
> > > > > > The balancer class in my HBase cluster is
> > > > > >
> > > > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> > > > > >
> > > > > > So, is this behavior expected? Can I assign different balancer
> > class
> > > to
> > > > > my
> > > > > > tables (I don't have HBase admin permission)? Which one should I
> > use?
> > > > > >
> > > > > > I just want HBase to evenly distribute the regions even they
> don't
> > > have
> > > > > > data (that's the purpose of pre-split I think).
> > > > > >
> > > > > >
> > > > > > Jianshi
> > > > > >
> > > > > >
> > > > > > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu 
> > wrote:
> > > > > >
> > > > > > > Yes. See the following method in HBaseAdmin:
> > > > > > >
> > > > > > >   public boolean balancer()
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang <
> > > > jianshi.hu...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Ted!
> > > > > > > >
> > > > > > > > Didn't know I still need to run the 'balancer' command.
> > > > > > > >
> > > > > > > > Is there a way to do it programmatically?
> > > > > > > >
> > > > > > > > Jianshi
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu  >
> > > > wrote:
> > > > > > > >
> > >

Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-22 Thread Jianshi Huang
Ah... I don't have access to HMaster logs... I need to ask the admin.

Jianshi

On Mon, Sep 22, 2014 at 11:49 PM, Ted Yu  wrote:

> bq. assign per-table balancer class
>
> No that I know of.
> Can you pastebin master log involving output from balancer ?
>
> Cheers
>
> On Mon, Sep 22, 2014 at 8:29 AM, Jianshi Huang 
> wrote:
>
> > Hi Ted,
> >
> > I moved setBalancerRunning before balancer and run them twice. However I
> > still got highly skewed region distribution.
> >
> > I guess it's because of the StochasticLoadBalancer, can I assign
> per-table
> > balancer class in HBase?
> >
> >
> > Jianshi
> >
> > On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu  wrote:
> >
> > > admin.setBalancerRunning() call should precede the call to
> > > admin.balancer().
> > >
> > > You can inspect master log to see whether regions are being moved off
> the
> > > heavily loaded server.
> > >
> > > Cheers
> > >
> > > On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > > > Hi Ted and others,
> > > >
> > > > I did the following after adding splits (without data) to my table,
> > > however
> > > > the region is still very imbalanced (one region server has 221
> regions
> > > and
> > > > other 50 region servers have about 4~8 regions each).
> > > >
> > > >   admin.balancer()
> > > >   admin.setBalancerRunning(true, true)
> > > >
> > > > The balancer class in my HBase cluster is
> > > >
> > > > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> > > >
> > > > So, is this behavior expected? Can I assign different balancer class
> to
> > > my
> > > > tables (I don't have HBase admin permission)? Which one should I use?
> > > >
> > > > I just want HBase to evenly distribute the regions even they don't
> have
> > > > data (that's the purpose of pre-split I think).
> > > >
> > > >
> > > > Jianshi
> > > >
> > > >
> > > > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu  wrote:
> > > >
> > > > > Yes. See the following method in HBaseAdmin:
> > > > >
> > > > >   public boolean balancer()
> > > > >
> > > > >
> > > > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks Ted!
> > > > > >
> > > > > > Didn't know I still need to run the 'balancer' command.
> > > > > >
> > > > > > Is there a way to do it programmatically?
> > > > > >
> > > > > > Jianshi
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu 
> > wrote:
> > > > > >
> > > > > > > After splitting the region, you may need to run balancer to
> > spread
> > > > the
> > > > > > new
> > > > > > > regions out.
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang <
> > > > jianshi.hu...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Shahab,
> > > > > > > >
> > > > > > > > I see, that seems to be the right way...
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus <
> > > > > shahab.yu...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Shahab
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Jianshi Huang
> > > > > > > >
> > > > > > > > LinkedIn: jianshi
> > > > > > > > Twitter: @jshuang
> > > > > > > > Github & Blog: http://huangjs.github.com/
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jianshi Huang
> > > > > >
> > > > > > LinkedIn: jianshi
> > > > > > Twitter: @jshuang
> > > > > > Github & Blog: http://huangjs.github.com/
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-22 Thread Jianshi Huang
Hi Ted,

I moved setBalancerRunning before balancer and run them twice. However I
still got highly skewed region distribution.

I guess it's because of the StochasticLoadBalancer, can I assign per-table
balancer class in HBase?


Jianshi

On Mon, Sep 22, 2014 at 9:50 PM, Ted Yu  wrote:

> admin.setBalancerRunning() call should precede the call to
> admin.balancer().
>
> You can inspect master log to see whether regions are being moved off the
> heavily loaded server.
>
> Cheers
>
> On Mon, Sep 22, 2014 at 1:42 AM, Jianshi Huang 
> wrote:
>
> > Hi Ted and others,
> >
> > I did the following after adding splits (without data) to my table,
> however
> > the region is still very imbalanced (one region server has 221 regions
> and
> > other 50 region servers have about 4~8 regions each).
> >
> >   admin.balancer()
> >   admin.setBalancerRunning(true, true)
> >
> > The balancer class in my HBase cluster is
> >
> > org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> >
> > So, is this behavior expected? Can I assign different balancer class to
> my
> > tables (I don't have HBase admin permission)? Which one should I use?
> >
> > I just want HBase to evenly distribute the regions even they don't have
> > data (that's the purpose of pre-split I think).
> >
> >
> > Jianshi
> >
> >
> > On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu  wrote:
> >
> > > Yes. See the following method in HBaseAdmin:
> > >
> > >   public boolean balancer()
> > >
> > >
> > > On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang  >
> > > wrote:
> > >
> > > > Thanks Ted!
> > > >
> > > > Didn't know I still need to run the 'balancer' command.
> > > >
> > > > Is there a way to do it programmatically?
> > > >
> > > > Jianshi
> > > >
> > > >
> > > >
> > > > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu  wrote:
> > > >
> > > > > After splitting the region, you may need to run balancer to spread
> > the
> > > > new
> > > > > regions out.
> > > > >
> > > > > Cheers
> > > > >
> > > > >
> > > > > On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Shahab,
> > > > > >
> > > > > > I see, that seems to be the right way...
> > > > > >
> > > > > >
> > > > > > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus <
> > > shahab.yu...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Shahab
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jianshi Huang
> > > > > >
> > > > > > LinkedIn: jianshi
> > > > > > Twitter: @jshuang
> > > > > > Github & Blog: http://huangjs.github.com/
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-22 Thread Jianshi Huang
Hi Ted and others,

I did the following after adding splits (without data) to my table, however
the region is still very imbalanced (one region server has 221 regions and
other 50 region servers have about 4~8 regions each).

  admin.balancer()
  admin.setBalancerRunning(true, true)

The balancer class in my HBase cluster is

org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer

So, is this behavior expected? Can I assign different balancer class to my
tables (I don't have HBase admin permission)? Which one should I use?

I just want HBase to evenly distribute the regions even they don't have
data (that's the purpose of pre-split I think).


Jianshi


On Sat, Sep 6, 2014 at 12:45 AM, Ted Yu  wrote:

> Yes. See the following method in HBaseAdmin:
>
>   public boolean balancer()
>
>
> On Fri, Sep 5, 2014 at 9:38 AM, Jianshi Huang 
> wrote:
>
> > Thanks Ted!
> >
> > Didn't know I still need to run the 'balancer' command.
> >
> > Is there a way to do it programmatically?
> >
> > Jianshi
> >
> >
> >
> > On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu  wrote:
> >
> > > After splitting the region, you may need to run balancer to spread the
> > new
> > > regions out.
> > >
> > > Cheers
> > >
> > >
> > > On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang  >
> > > wrote:
> > >
> > > > Hi Shahab,
> > > >
> > > > I see, that seems to be the right way...
> > > >
> > > >
> > > > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus <
> shahab.yu...@gmail.com>
> > > > wrote:
> > > >
> > > > > Shahab
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?

2014-09-17 Thread Jianshi Huang
Thanks Esteban for the suggestion.

For case 2) KeyPrefixRegionSplitPolicy won't be enough I think as we're
constantly adding new types so the #types is unknown at the beginning, and
when there's a new type of data, it will add pre-splits [type|00, type|01,
..., type|FF] to the table. Data is ingested one type after another so if
there's no auto-splits, ingestion will be too slow.

For case 1) I thought about binning, however it makes scans in
tableInputFormat more complicated. I think auto pre-splits can solve it so
currently a sampling process is run to compute the splitKeys for every ts
data to be ingested.

Jianshi


On Thu, Sep 18, 2014 at 3:19 AM, Esteban Gutierrez 
wrote:

> Thanks Jianshi for that helpful information,
>
> I think for use case 1) it depends on the data ingestion rate when the
> regions need to split. The synchronous split operation makes some sense
> there  if you want the regions to contain specific time ranges and/or
> number of records.
>
> For use case 2) I think is a good match for the KeyPrefixRegionSplitPolicy
> or DelimitedKeyPrefixRegionSplitPolicy. Since the regions will be split
> based on the  if type length is fixed or if the type is of varying
> length but delimited with |
>
> On a second thought, it might be even possible to solve 1) with those
> prefix based split policies if you use a prefix for your key that also
> varies monotonically or can be passed by the client when it has reached
> some threshold, e.g. after writing X billion data points, use prefix 001
> and next Y billion data rows use prefix 002 or something like that.
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Wed, Sep 17, 2014 at 11:53 AM, Jianshi Huang 
> wrote:
>
> > Hi Esteban,
> >
> > Two reasons to split dynamically,
> >
> > 1) I have a column family that stores timeseries data for mapreduce
> tasks,
> > and the rowkey is monotonically increasing to make scanning easier.
> >
> > 2) (a better reason), I'm storing multiple types of data in the same
> table,
> > and I have about 500TB of data in total. That's many billions of rows and
> > many thousands of regions. I want to make sure ingesting one type of data
> > won't touch every region which will cause a lot of fragments and merge
> > operations, the rowkey is designed as ||.
> >
> > So either way I would want a dynamic split in my design.
> >
> > Jianshi
> >
> >
> > On Thu, Sep 18, 2014 at 2:39 AM, Esteban Gutierrez  >
> > wrote:
> >
> > > Jianshi,
> > >
> > > The retry is not an expected behavior that the client should be doing.
> In
> > > fact you don't want your clients to issue admin operations to the
> cluster
> > > ;)
> > >
> > > Shahab's option is the best alternative by polling when the number of
> > > regions has changed in the table you want to modify the splits
> > dynamically.
> > > The JIRA that Ted suggested requires modification in the core table
> > > operations to support sync operations and requires some major work to
> do
> > it
> > > right. Ted's alternative to create the splits at table creation time is
> > the
> > > best option if you can pre-split IMHO.
> > >
> > > If you could elaborate more on the practical reasons you mention to
> > create
> > > synchronously those new regions that would be great for us. Maybe its
> > > related to multi-tenancy but I'm just guessing :)
> > >
> > > esteban.
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > > On Wed, Sep 17, 2014 at 11:09 AM, Ted Yu  wrote:
> > >
> > > > Jianshi:
> > > > See HBASE-11608 Add synchronous split
> > > >
> > > > bq. createTable does something special?
> > > >
> > > > Yes. See this in HBaseAdmin:
> > > >
> > > >   public void createTable(final HTableDescriptor desc, byte [][]
> > > splitKeys)
> > > >
> > > > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > I see Shahab, async makes sense, but I prefer that the HBase client
> > > does
> > > > > the retry for me, and let me specify a timeout parameter.
> > > > >
> > > > > One question, does that mean adding multiple splits into one region
> > has
> > > > to
> > > > > be done sequentially? How can I add region splits

Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?

2014-09-17 Thread Jianshi Huang
Hi Esteban,

Two reasons to split dynamically,

1) I have a column family that stores timeseries data for mapreduce tasks,
and the rowkey is monotonically increasing to make scanning easier.

2) (a better reason), I'm storing multiple types of data in the same table,
and I have about 500TB of data in total. That's many billions of rows and
many thousands of regions. I want to make sure ingesting one type of data
won't touch every region which will cause a lot of fragments and merge
operations, the rowkey is designed as ||.

So either way I would want a dynamic split in my design.

Jianshi


On Thu, Sep 18, 2014 at 2:39 AM, Esteban Gutierrez 
wrote:

> Jianshi,
>
> The retry is not an expected behavior that the client should be doing. In
> fact you don't want your clients to issue admin operations to the cluster
> ;)
>
> Shahab's option is the best alternative by polling when the number of
> regions has changed in the table you want to modify the splits dynamically.
> The JIRA that Ted suggested requires modification in the core table
> operations to support sync operations and requires some major work to do it
> right. Ted's alternative to create the splits at table creation time is the
> best option if you can pre-split IMHO.
>
> If you could elaborate more on the practical reasons you mention to create
> synchronously those new regions that would be great for us. Maybe its
> related to multi-tenancy but I'm just guessing :)
>
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Wed, Sep 17, 2014 at 11:09 AM, Ted Yu  wrote:
>
> > Jianshi:
> > See HBASE-11608 Add synchronous split
> >
> > bq. createTable does something special?
> >
> > Yes. See this in HBaseAdmin:
> >
> >   public void createTable(final HTableDescriptor desc, byte [][]
> splitKeys)
> >
> > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang  >
> > wrote:
> >
> > > I see Shahab, async makes sense, but I prefer that the HBase client
> does
> > > the retry for me, and let me specify a timeout parameter.
> > >
> > > One question, does that mean adding multiple splits into one region has
> > to
> > > be done sequentially? How can I add region splits in parallel? Does
> > > createTable does something special?
> > >
> > >
> > > Jianshi
> > >
> > >
> > > On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus 
> > > wrote:
> > >
> > > > Split is an async operation. When you call it, and the call returns,
> it
> > > > does not mean that the region has been created yet.
> > > >
> > > > So either you wait for a while (using Thread.sleep) or check for the
> > > number
> > > > of regions in a loop and until they have increased to the value you
> > want
> > > > and then access the region. The former is not a good idea, though you
> > can
> > > > try it out just to make sure that this is indeed the issue.
> > > >
> > > > What am I suggesting is something like (pseudo code):
> > > >
> > > > while(new#regions > old#regions)
> > > > {
> > > >new#regions = admin.getLatest#regions
> > > > }
> > > >
> > > > Regards,
> > > > Shahab
> > > >
> > > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com>
> > > > wrote:
> > > >
> > > > > I constantly get the following errors when I tried to add splits
> to a
> > > > > table.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
> > > > > org.apache.hadoop.hbase.NotServingRegionException: Region
> > > > >
> > > >
> > >
> >
> grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568
> > > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on
> > > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> > >

Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?

2014-09-17 Thread Jianshi Huang
You rock Ted, I would also add synchronous addSplits as well, there's no
good reason multiple splits has to be done sequentially.

I also checked createTable, and I trace the code here and lost track...

executeCallable(new MasterCallable(getConnection()) {
  @Override
  public Void call() throws ServiceException {
CreateTableRequest request =
RequestConverter.buildCreateTableRequest(desc, splitKeys);
master.createTable(null, request);
return null;
  }
});

So what happened in the handler of createTableRequest? Which part of code
should I check?

Jianshi


On Thu, Sep 18, 2014 at 2:09 AM, Ted Yu  wrote:

> Jianshi:
> See HBASE-11608 Add synchronous split
>
> bq. createTable does something special?
>
> Yes. See this in HBaseAdmin:
>
>   public void createTable(final HTableDescriptor desc, byte [][] splitKeys)
>
> On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang 
> wrote:
>
> > I see Shahab, async makes sense, but I prefer that the HBase client does
> > the retry for me, and let me specify a timeout parameter.
> >
> > One question, does that mean adding multiple splits into one region has
> to
> > be done sequentially? How can I add region splits in parallel? Does
> > createTable does something special?
> >
> >
> > Jianshi
> >
> >
> > On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus 
> > wrote:
> >
> > > Split is an async operation. When you call it, and the call returns, it
> > > does not mean that the region has been created yet.
> > >
> > > So either you wait for a while (using Thread.sleep) or check for the
> > number
> > > of regions in a loop and until they have increased to the value you
> want
> > > and then access the region. The former is not a good idea, though you
> can
> > > try it out just to make sure that this is indeed the issue.
> > >
> > > What am I suggesting is something like (pseudo code):
> > >
> > > while(new#regions > old#regions)
> > > {
> > >new#regions = admin.getLatest#regions
> > > }
> > >
> > > Regards,
> > > Shahab
> > >
> > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > > > I constantly get the following errors when I tried to add splits to a
> > > > table.
> > > >
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
> > > > org.apache.hadoop.hbase.NotServingRegionException: Region
> > > >
> > >
> >
> grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568
> > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on
> > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818)
> > > > at
> > > >
> > > >
> > > > But when I checked the region server (from hbase' webUI), the region
> is
> > > > actually listed there.
> > > >
> > > > What does the error mean actually? How can I solve it?
> > > >
> > > > Currently I'm adding splits single-threaded, and I want to make it
> > > > parallel, is there anything I need to be careful about?
> > > >
> > > > Here's the code for adding splits:
> > > >
> > > >   def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit
> > = {
> > > > val admin = new HBaseAdmin(conn)
> > > >
> > > > try {
> > > >   val regions = admin.getTableRegions(tableName.getBytes("UTF8"))
> > > >   val regionStartKeys = regions.map(_.getStartKey)
> > > >   val splits = splitKeys.diff(regionStartKeys)
> > > >
> > > >   splits.foreach { splitPoint =>
> > > > admin.split(tableName.getBytes("UTF8"), splitPoint)
> > > >   }
> > > >   // NOTE: important!
> > > >   admin.balancer()
> > > > }
> > > > finally {
> > > >   admin.close()
> > > > }
> > > >   }
> > > >
> > > >
> > > > Any help is appreciated.
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?

2014-09-17 Thread Jianshi Huang
Yes Esteban, there're very practical reasons to do the pre-split
dynamically.

Jianshi

On Thu, Sep 18, 2014 at 1:41 AM, Esteban Gutierrez 
wrote:

> Hi Jianshi,
>
> Is there any reason why you need to split dynamically the table? Users
> usually pre-split their tables with a specific number of splits or they
> pick a region split policy that fits their needs:
>
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/DelimitedKeyPrefixRegionSplitPolicy.html
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/ConstantSizeRegionSplitPolicy.html
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy.html
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/KeyPrefixRegionSplitPolicy.html
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/DisabledRegionSplitPolicy.html
>
> or they have the options to implement their own. See for some details
> http://hbase.apache.org/book/regions.arch.html#arch.region.split
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Wed, Sep 17, 2014 at 5:06 AM, Shahab Yunus 
> wrote:
>
> > Split is an async operation. When you call it, and the call returns, it
> > does not mean that the region has been created yet.
> >
> > So either you wait for a while (using Thread.sleep) or check for the
> number
> > of regions in a loop and until they have increased to the value you want
> > and then access the region. The former is not a good idea, though you can
> > try it out just to make sure that this is indeed the issue.
> >
> > What am I suggesting is something like (pseudo code):
> >
> > while(new#regions > old#regions)
> > {
> >new#regions = admin.getLatest#regions
> > }
> >
> > Regards,
> > Shahab
> >
> > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang 
> > wrote:
> >
> > > I constantly get the following errors when I tried to add splits to a
> > > table.
> > >
> > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
> > > org.apache.hadoop.hbase.NotServingRegionException: Region
> > >
> >
> grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568
> > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on
> > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818)
> > > at
> > >
> > >
> > > But when I checked the region server (from hbase' webUI), the region is
> > > actually listed there.
> > >
> > > What does the error mean actually? How can I solve it?
> > >
> > > Currently I'm adding splits single-threaded, and I want to make it
> > > parallel, is there anything I need to be careful about?
> > >
> > > Here's the code for adding splits:
> > >
> > >   def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit
> = {
> > > val admin = new HBaseAdmin(conn)
> > >
> > > try {
> > >   val regions = admin.getTableRegions(tableName.getBytes("UTF8"))
> > >   val regionStartKeys = regions.map(_.getStartKey)
> > >   val splits = splitKeys.diff(regionStartKeys)
> > >
> > >   splits.foreach { splitPoint =>
> > > admin.split(tableName.getBytes("UTF8"), splitPoint)
> > >   }
> > >   // NOTE: important!
> > >   admin.balancer()
> > > }
> > > finally {
> > >   admin.close()
> > > }
> > >   }
> > >
> > >
> > > Any help is appreciated.
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?

2014-09-17 Thread Jianshi Huang
I see Shahab, async makes sense, but I prefer that the HBase client does
the retry for me, and let me specify a timeout parameter.

One question, does that mean adding multiple splits into one region has to
be done sequentially? How can I add region splits in parallel? Does
createTable does something special?


Jianshi


On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus 
wrote:

> Split is an async operation. When you call it, and the call returns, it
> does not mean that the region has been created yet.
>
> So either you wait for a while (using Thread.sleep) or check for the number
> of regions in a loop and until they have increased to the value you want
> and then access the region. The former is not a good idea, though you can
> try it out just to make sure that this is indeed the issue.
>
> What am I suggesting is something like (pseudo code):
>
> while(new#regions > old#regions)
> {
>new#regions = admin.getLatest#regions
> }
>
> Regards,
> Shahab
>
> On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang 
> wrote:
>
> > I constantly get the following errors when I tried to add splits to a
> > table.
> >
> >
> >
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
> > org.apache.hadoop.hbase.NotServingRegionException: Region
> >
> grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568
> > 484.e7743495366df3c82a8571b36c2bdac3. is not online on
> > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818)
> > at
> >
> >
> > But when I checked the region server (from hbase' webUI), the region is
> > actually listed there.
> >
> > What does the error mean actually? How can I solve it?
> >
> > Currently I'm adding splits single-threaded, and I want to make it
> > parallel, is there anything I need to be careful about?
> >
> > Here's the code for adding splits:
> >
> >   def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit = {
> > val admin = new HBaseAdmin(conn)
> >
> > try {
> >   val regions = admin.getTableRegions(tableName.getBytes("UTF8"))
> >   val regionStartKeys = regions.map(_.getStartKey)
> >   val splits = splitKeys.diff(regionStartKeys)
> >
> >   splits.foreach { splitPoint =>
> > admin.split(tableName.getBytes("UTF8"), splitPoint)
> >   }
> >   // NOTE: important!
> >   admin.balancer()
> > }
> > finally {
> >   admin.close()
> > }
> >   }
> >
> >
> > Any help is appreciated.
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Error during HBaseAdmin.split: Exception: org.apache.hadoop.hbase.NotServingRegionException, What does that mean?

2014-09-17 Thread Jianshi Huang
I constantly get the following errors when I tried to add splits to a table.

org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
org.apache.hadoop.hbase.NotServingRegionException: Region
grapple_vertices,cust|rval#7eb7cffca280|1636500018299676757,1410945568
484.e7743495366df3c82a8571b36c2bdac3. is not online on
lvshdc5dn0193.lvs.paypal.com,60020,1405014719359
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818)
at


But when I checked the region server (from hbase' webUI), the region is
actually listed there.

What does the error mean actually? How can I solve it?

Currently I'm adding splits single-threaded, and I want to make it
parallel, is there anything I need to be careful about?

Here's the code for adding splits:

  def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): Unit = {
val admin = new HBaseAdmin(conn)

try {
  val regions = admin.getTableRegions(tableName.getBytes("UTF8"))
  val regionStartKeys = regions.map(_.getStartKey)
  val splits = splitKeys.diff(regionStartKeys)

  splits.foreach { splitPoint =>
admin.split(tableName.getBytes("UTF8"), splitPoint)
  }
  // NOTE: important!
  admin.balancer()
}
finally {
  admin.close()
}
  }


Any help is appreciated.

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Deploy filter on per table baiss

2014-09-09 Thread Jianshi Huang
Thanks Ted!

Jianshi

On Tue, Sep 9, 2014 at 10:39 PM, Ted Yu  wrote:

> Please take a look at HBASE-1936
>
> Cheers
>
> On Mon, Sep 8, 2014 at 11:26 PM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > According to the HBAse definitive guide, I need to change to change
> > hbase-env.sh and put my jars in hbase's classpath, then I also need to
> > restart hbase daemon to make my customized filters effective.
> >
> > In the Coprocessor loading section, it also mentioned that coprocessor
> can
> > be setup and loaded on per table basis.
> >
> > So is it also possible for filter? The main problem is that I don't have
> > HBase admin permissions to do the change.
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Deploy filter on per table baiss

2014-09-08 Thread Jianshi Huang
Hi,

According to the HBAse definitive guide, I need to change to change
hbase-env.sh and put my jars in hbase's classpath, then I also need to
restart hbase daemon to make my customized filters effective.

In the Coprocessor loading section, it also mentioned that coprocessor can
be setup and loaded on per table basis.

So is it also possible for filter? The main problem is that I don't have
HBase admin permissions to do the change.


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-07 Thread Jianshi Huang
Locality is important, that why I chose CF to put related data into one
group. I can surely put the CF part to the head of rowkey to achieve
similar result, but since the number of types is fixed, I don't any benefit
doing that.

With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the
performance should be similar.

Am I missing something? Please enlighten me.

Jianshi

On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel 
wrote:

> I would suggest rethinking column families and look at your potential for
> a slightly different row key.
>
> Going with column families doesn’t really make sense.
>
> Also how wide are the rows? (worst case?)
>
> one idea is to make type part of the RK…
>
> HTH
>
> -Mike
>
> On Sep 7, 2014, at 2:40 AM, Jianshi Huang  wrote:
>
> > Hi Michael,
> >
> > Thanks for the questions.
> >
> > I'm modeling dynamic Graphs in HBase, all elements (vertices, edges)
> have a
> > timestamp and I can query things like events between A and B for the
> last 7
> > days.
> >
> > CFs are used for grouping different types of data for the same account.
> > However, I have lots of skews in the data, to avoid having too much for
> the
> > same row, I had to put what was in CQs to now RKs. So CF now acts more
> like
> > a table.
> >
> > There's one CF containing sequence of events ordered by timestamp, and
> this
> > CF is quite different as the use case is mostly in mapreduce jobs.
> >
> > Jianshi
> >
> >
> >
> >
> > On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel  >
> > wrote:
> >
> >> Again, a silly question.
> >>
> >> Why are you using column families?
> >>
> >> Just to play devil’s advocate in terms of design, why are you not
> treating
> >> your row as a record? Think hierarchal not relational.
> >>
> >> This really gets in to some design theory.
> >>
> >> Think Column Family as a way to group data that has the same row key,
> >> reference the same thing, yet the data in each column family is used
> >> separately.
> >> The example I always turn to when teaching, is to think of an order
> entry
> >> system at a retailer.
> >>
> >> You generate data which is segmented by business process. (order entry,
> >> pick slips, shipping, invoicing) All reflect a single order, yet the
> data
> >> in each process tends to be accessed separately.
> >> (You don’t need the order entry when using the pick slip to pull orders
> >> from the warehouse.)  So here, the data access pattern is that each
> column
> >> family is used separately, except in generating the data (the order
> entry
> >> is used to generate the pick slip(s) and set up things like backorders
> and
> >> then the pick process generates the shipping slip(s) etc …  And since
> they
> >> are all focused on the same order, they have the same row key.
> >>
> >> So its reasonable to ask how you are accessing the data and how you are
> >> designing your HBase model?
> >>
> >> Many times,  developers create a model using column families because the
> >> developer is thinking in terms of relationships. Not access patterns on
> the
> >> data.
> >>
> >> Does this make sense?
> >>
> >>
> >> On Sep 6, 2014, at 7:46 PM, Jianshi Huang 
> wrote:
> >>
> >>> BTW, a little explanation about the binning I mentioned.
> >>>
> >>> Currently the rowkey looks like ##.
> >>>
> >>> And with binning, it looks like
> >>> ###. The bin_number
> could
> >> be
> >>> id % 256 or timestamp % 256. And the table could be pre-splitted. So
> >> future
> >>> ingestions could do parallel insertion to # regions, even without
> >>> pre-split.
> >>>
> >>>
> >>> Jianshi
> >>>
> >>>
> >>> On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang  >
> >>> wrote:
> >>>
> >>>> Each range might span multiple regions, depending on the data size I
> >> want
> >>>> scan for MR jobs.
> >>>>
> >>>> The ranges are dynamic, specified by the user, but the number of bins
> >> can
> >>>> be static (when the table/schema is created).
> >>>>
> >>>> Jianshi
> >>>>
> >>>>
> >>>> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu  wrote:
> >>&g

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi Michael,

Thanks for the questions.

I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a
timestamp and I can query things like events between A and B for the last 7
days.

CFs are used for grouping different types of data for the same account.
However, I have lots of skews in the data, to avoid having too much for the
same row, I had to put what was in CQs to now RKs. So CF now acts more like
a table.

There's one CF containing sequence of events ordered by timestamp, and this
CF is quite different as the use case is mostly in mapreduce jobs.

Jianshi




On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel 
wrote:

> Again, a silly question.
>
> Why are you using column families?
>
> Just to play devil’s advocate in terms of design, why are you not treating
> your row as a record? Think hierarchal not relational.
>
> This really gets in to some design theory.
>
> Think Column Family as a way to group data that has the same row key,
> reference the same thing, yet the data in each column family is used
> separately.
> The example I always turn to when teaching, is to think of an order entry
> system at a retailer.
>
> You generate data which is segmented by business process. (order entry,
> pick slips, shipping, invoicing) All reflect a single order, yet the data
> in each process tends to be accessed separately.
> (You don’t need the order entry when using the pick slip to pull orders
> from the warehouse.)  So here, the data access pattern is that each column
> family is used separately, except in generating the data (the order entry
> is used to generate the pick slip(s) and set up things like backorders and
> then the pick process generates the shipping slip(s) etc …  And since they
> are all focused on the same order, they have the same row key.
>
> So its reasonable to ask how you are accessing the data and how you are
> designing your HBase model?
>
> Many times,  developers create a model using column families because the
> developer is thinking in terms of relationships. Not access patterns on the
> data.
>
> Does this make sense?
>
>
> On Sep 6, 2014, at 7:46 PM, Jianshi Huang  wrote:
>
> > BTW, a little explanation about the binning I mentioned.
> >
> > Currently the rowkey looks like ##.
> >
> > And with binning, it looks like
> > ###. The bin_number could
> be
> > id % 256 or timestamp % 256. And the table could be pre-splitted. So
> future
> > ingestions could do parallel insertion to # regions, even without
> > pre-split.
> >
> >
> > Jianshi
> >
> >
> > On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang 
> > wrote:
> >
> >> Each range might span multiple regions, depending on the data size I
> want
> >> scan for MR jobs.
> >>
> >> The ranges are dynamic, specified by the user, but the number of bins
> can
> >> be static (when the table/schema is created).
> >>
> >> Jianshi
> >>
> >>
> >> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu  wrote:
> >>
> >>> bq. 16 to 256 ranges
> >>>
> >>> Would each range be within single region or the range may span regions
> ?
> >>> Are the ranges dynamic ?
> >>>
> >>> Using command line for multiple ranges would be out of question. A file
> >>> with ranges is needed.
> >>>
> >>> Cheers
> >>>
> >>>
> >>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> >>> wrote:
> >>>
> >>>> Thanks Ted for the reference.
> >>>>
> >>>> That's right, extend the row.start and row.end to specify multiple
> >>> ranges
> >>>> and also getSplits.
> >>>>
> >>>> I would probably bin the event sequence CF into 16 to 256 bins. So 16
> to
> >>>> 256 ranges.
> >>>>
> >>>> Jianshi
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu  wrote:
> >>>>
> >>>>> Please refer to HBASE-5416 Filter on one CF and if a match, then load
> >>> and
> >>>>> return full row
> >>>>>
> >>>>> bq. to extend TableInputFormat to accept multiple row ranges
> >>>>>
> >>>>> You mean extending hbase.mapreduce.scan.row.start and
> >>>>> hbase.mapreduce.scan.row.stop so that multiple ranges can be
> >>> specified ?
> >>>>> How many such ranges do you normally need ?
>

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
BTW, a little explanation about the binning I mentioned.

Currently the rowkey looks like ##.

And with binning, it looks like
###. The bin_number could be
id % 256 or timestamp % 256. And the table could be pre-splitted. So future
ingestions could do parallel insertion to # regions, even without
pre-split.


Jianshi


On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang 
wrote:

> Each range might span multiple regions, depending on the data size I want
> scan for MR jobs.
>
> The ranges are dynamic, specified by the user, but the number of bins can
> be static (when the table/schema is created).
>
> Jianshi
>
>
> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu  wrote:
>
>> bq. 16 to 256 ranges
>>
>> Would each range be within single region or the range may span regions ?
>> Are the ranges dynamic ?
>>
>> Using command line for multiple ranges would be out of question. A file
>> with ranges is needed.
>>
>> Cheers
>>
>>
>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang 
>> wrote:
>>
>> > Thanks Ted for the reference.
>> >
>> > That's right, extend the row.start and row.end to specify multiple
>> ranges
>> > and also getSplits.
>> >
>> > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
>> > 256 ranges.
>> >
>> > Jianshi
>> >
>> >
>> >
>> > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu  wrote:
>> >
>> > > Please refer to HBASE-5416 Filter on one CF and if a match, then load
>> and
>> > > return full row
>> > >
>> > > bq. to extend TableInputFormat to accept multiple row ranges
>> > >
>> > > You mean extending hbase.mapreduce.scan.row.start and
>> > > hbase.mapreduce.scan.row.stop so that multiple ranges can be
>> specified ?
>> > > How many such ranges do you normally need ?
>> > >
>> > > Cheers
>> > >
>> > >
>> > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
>> jianshi.hu...@gmail.com>
>> > > wrote:
>> > >
>> > > > Thanks Ted,
>> > > >
>> > > > I'll pre-split the table during ingestion. The reason to keep the
>> > rowkey
>> > > > monotonic is for easier working with TableInputFormat, otherwise I
>> > > would've
>> > > > binned it into 256 splits. (well, I think a good way is to extend
>> > > > TableInputFormat to accept multiple row ranges, if there's an
>> existing
>> > > > efficient implementation, please let me know :)
>> > > >
>> > > > Would you elaborate a little more on the heap memory usage during
>> scan?
>> > > Is
>> > > > there any reference to that?
>> > > >
>> > > > Jianshi
>> > > >
>> > > >
>> > > >
>> > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu  wrote:
>> > > >
>> > > > > If you use monotonically increasing rowkeys, separating out the
>> > column
>> > > > > family into a new table would give you same issue you're facing
>> > today.
>> > > > >
>> > > > > Using a single table, essential column family feature would reduce
>> > the
>> > > > > amount of heap memory used during scan. With two tables, there is
>> no
>> > > such
>> > > > > facility.
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > >
>> > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
>> > > jianshi.hu...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Ted,
>> > > > > >
>> > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the
>> > > > > performance
>> > > > > > I care most are scan performance.
>> > > > > >
>> > > > > > It's mostly for analytics, so I don't care much about atomicity
>> > > > > currently.
>> > > > > >
>> > > > > > What's your suggestion?
>> > > > > >
>> > > > > > Jianshi
>> > > > > >
>> > > > > >
>> > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu 
>> > wrote:
>> > > > > &

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Each range might span multiple regions, depending on the data size I want
scan for MR jobs.

The ranges are dynamic, specified by the user, but the number of bins can
be static (when the table/schema is created).

Jianshi


On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu  wrote:

> bq. 16 to 256 ranges
>
> Would each range be within single region or the range may span regions ?
> Are the ranges dynamic ?
>
> Using command line for multiple ranges would be out of question. A file
> with ranges is needed.
>
> Cheers
>
>
> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang 
> wrote:
>
> > Thanks Ted for the reference.
> >
> > That's right, extend the row.start and row.end to specify multiple ranges
> > and also getSplits.
> >
> > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
> > 256 ranges.
> >
> > Jianshi
> >
> >
> >
> > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu  wrote:
> >
> > > Please refer to HBASE-5416 Filter on one CF and if a match, then load
> and
> > > return full row
> > >
> > > bq. to extend TableInputFormat to accept multiple row ranges
> > >
> > > You mean extending hbase.mapreduce.scan.row.start and
> > > hbase.mapreduce.scan.row.stop so that multiple ranges can be specified
> ?
> > > How many such ranges do you normally need ?
> > >
> > > Cheers
> > >
> > >
> > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Ted,
> > > >
> > > > I'll pre-split the table during ingestion. The reason to keep the
> > rowkey
> > > > monotonic is for easier working with TableInputFormat, otherwise I
> > > would've
> > > > binned it into 256 splits. (well, I think a good way is to extend
> > > > TableInputFormat to accept multiple row ranges, if there's an
> existing
> > > > efficient implementation, please let me know :)
> > > >
> > > > Would you elaborate a little more on the heap memory usage during
> scan?
> > > Is
> > > > there any reference to that?
> > > >
> > > > Jianshi
> > > >
> > > >
> > > >
> > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu  wrote:
> > > >
> > > > > If you use monotonically increasing rowkeys, separating out the
> > column
> > > > > family into a new table would give you same issue you're facing
> > today.
> > > > >
> > > > > Using a single table, essential column family feature would reduce
> > the
> > > > > amount of heap memory used during scan. With two tables, there is
> no
> > > such
> > > > > facility.
> > > > >
> > > > > Cheers
> > > > >
> > > > >
> > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
> > > jianshi.hu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the
> > > > > performance
> > > > > > I care most are scan performance.
> > > > > >
> > > > > > It's mostly for analytics, so I don't care much about atomicity
> > > > > currently.
> > > > > >
> > > > > > What's your suggestion?
> > > > > >
> > > > > > Jianshi
> > > > > >
> > > > > >
> > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu 
> > wrote:
> > > > > >
> > > > > > > Is this the same table you mentioned in the thread about
> > > > > > > RegionTooBusyException
> > > > > > > ?
> > > > > > >
> > > > > > > If you move the column family to another table, you may have to
> > > > handle
> > > > > > > atomicity yourself - currently atomic operations are within
> > region
> > > > > > > boundaries.
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang <
> > > > jianshi.hu...@gmail.com
> > > > > >
> > > > > >

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Thanks Ted for the reference.

That's right, extend the row.start and row.end to specify multiple ranges
and also getSplits.

I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
256 ranges.

Jianshi



On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu  wrote:

> Please refer to HBASE-5416 Filter on one CF and if a match, then load and
> return full row
>
> bq. to extend TableInputFormat to accept multiple row ranges
>
> You mean extending hbase.mapreduce.scan.row.start and
> hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ?
> How many such ranges do you normally need ?
>
> Cheers
>
>
> On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang 
> wrote:
>
> > Thanks Ted,
> >
> > I'll pre-split the table during ingestion. The reason to keep the rowkey
> > monotonic is for easier working with TableInputFormat, otherwise I
> would've
> > binned it into 256 splits. (well, I think a good way is to extend
> > TableInputFormat to accept multiple row ranges, if there's an existing
> > efficient implementation, please let me know :)
> >
> > Would you elaborate a little more on the heap memory usage during scan?
> Is
> > there any reference to that?
> >
> > Jianshi
> >
> >
> >
> > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu  wrote:
> >
> > > If you use monotonically increasing rowkeys, separating out the column
> > > family into a new table would give you same issue you're facing today.
> > >
> > > Using a single table, essential column family feature would reduce the
> > > amount of heap memory used during scan. With two tables, there is no
> such
> > > facility.
> > >
> > > Cheers
> > >
> > >
> > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > Yes, that's the table having RegionTooBusyExceptions :) But the
> > > performance
> > > > I care most are scan performance.
> > > >
> > > > It's mostly for analytics, so I don't care much about atomicity
> > > currently.
> > > >
> > > > What's your suggestion?
> > > >
> > > > Jianshi
> > > >
> > > >
> > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu  wrote:
> > > >
> > > > > Is this the same table you mentioned in the thread about
> > > > > RegionTooBusyException
> > > > > ?
> > > > >
> > > > > If you move the column family to another table, you may have to
> > handle
> > > > > atomicity yourself - currently atomic operations are within region
> > > > > boundaries.
> > > > >
> > > > > Cheers
> > > > >
> > > > >
> > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm currently putting everything into one table (to make cross
> > > > reference
> > > > > > queries easier) and there's one CF which contains rowkeys very
> > > > different
> > > > > to
> > > > > > the rest. Currently it works well, but I'm wondering if it will
> > cause
> > > > > > performance issues in the future.
> > > > > >
> > > > > > So my questions are
> > > > > >
> > > > > > 1) will there be performance penalties in the way I'm doing?
> > > > > > 2) should I move that CF to a separate table?
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > Jianshi Huang
> > > > > >
> > > > > > LinkedIn: jianshi
> > > > > > Twitter: @jshuang
> > > > > > Github & Blog: http://huangjs.github.com/
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Thanks Ted,

I'll pre-split the table during ingestion. The reason to keep the rowkey
monotonic is for easier working with TableInputFormat, otherwise I would've
binned it into 256 splits. (well, I think a good way is to extend
TableInputFormat to accept multiple row ranges, if there's an existing
efficient implementation, please let me know :)

Would you elaborate a little more on the heap memory usage during scan? Is
there any reference to that?

Jianshi



On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu  wrote:

> If you use monotonically increasing rowkeys, separating out the column
> family into a new table would give you same issue you're facing today.
>
> Using a single table, essential column family feature would reduce the
> amount of heap memory used during scan. With two tables, there is no such
> facility.
>
> Cheers
>
>
> On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang 
> wrote:
>
> > Hi Ted,
> >
> > Yes, that's the table having RegionTooBusyExceptions :) But the
> performance
> > I care most are scan performance.
> >
> > It's mostly for analytics, so I don't care much about atomicity
> currently.
> >
> > What's your suggestion?
> >
> > Jianshi
> >
> >
> > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu  wrote:
> >
> > > Is this the same table you mentioned in the thread about
> > > RegionTooBusyException
> > > ?
> > >
> > > If you move the column family to another table, you may have to handle
> > > atomicity yourself - currently atomic operations are within region
> > > boundaries.
> > >
> > > Cheers
> > >
> > >
> > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang  >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm currently putting everything into one table (to make cross
> > reference
> > > > queries easier) and there's one CF which contains rowkeys very
> > different
> > > to
> > > > the rest. Currently it works well, but I'm wondering if it will cause
> > > > performance issues in the future.
> > > >
> > > > So my questions are
> > > >
> > > > 1) will there be performance penalties in the way I'm doing?
> > > > 2) should I move that CF to a separate table?
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Well, write performance is also important... I'll probably ingest 1k~10k
records/second.

Jianshi


On Sun, Sep 7, 2014 at 1:11 AM, Jianshi Huang 
wrote:

> Hi Ted,
>
> Yes, that's the table having RegionTooBusyExceptions :) But the
> performance I care most are scan performance.
>
> It's mostly for analytics, so I don't care much about atomicity currently.
>
> What's your suggestion?
>
> Jianshi
>
>
> On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu  wrote:
>
>> Is this the same table you mentioned in the thread about
>> RegionTooBusyException
>> ?
>>
>> If you move the column family to another table, you may have to handle
>> atomicity yourself - currently atomic operations are within region
>> boundaries.
>>
>> Cheers
>>
>>
>> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang 
>> wrote:
>>
>> > Hi,
>> >
>> > I'm currently putting everything into one table (to make cross reference
>> > queries easier) and there's one CF which contains rowkeys very
>> different to
>> > the rest. Currently it works well, but I'm wondering if it will cause
>> > performance issues in the future.
>> >
>> > So my questions are
>> >
>> > 1) will there be performance penalties in the way I'm doing?
>> > 2) should I move that CF to a separate table?
>> >
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>> > LinkedIn: jianshi
>> > Twitter: @jshuang
>> > Github & Blog: http://huangjs.github.com/
>> >
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi Ted,

Yes, that's the table having RegionTooBusyExceptions :) But the performance
I care most are scan performance.

It's mostly for analytics, so I don't care much about atomicity currently.

What's your suggestion?

Jianshi


On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu  wrote:

> Is this the same table you mentioned in the thread about
> RegionTooBusyException
> ?
>
> If you move the column family to another table, you may have to handle
> atomicity yourself - currently atomic operations are within region
> boundaries.
>
> Cheers
>
>
> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > I'm currently putting everything into one table (to make cross reference
> > queries easier) and there's one CF which contains rowkeys very different
> to
> > the rest. Currently it works well, but I'm wondering if it will cause
> > performance issues in the future.
> >
> > So my questions are
> >
> > 1) will there be performance penalties in the way I'm doing?
> > 2) should I move that CF to a separate table?
> >
> >
> > Thanks,
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi,

I'm currently putting everything into one table (to make cross reference
queries easier) and there's one CF which contains rowkeys very different to
the rest. Currently it works well, but I'm wondering if it will cause
performance issues in the future.

So my questions are

1) will there be performance penalties in the way I'm doing?
2) should I move that CF to a separate table?


Thanks,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-05 Thread Jianshi Huang
Thanks Ted!

Didn't know I still need to run the 'balancer' command.

Is there a way to do it programmatically?

Jianshi



On Sat, Sep 6, 2014 at 12:29 AM, Ted Yu  wrote:

> After splitting the region, you may need to run balancer to spread the new
> regions out.
>
> Cheers
>
>
> On Fri, Sep 5, 2014 at 9:25 AM, Jianshi Huang 
> wrote:
>
> > Hi Shahab,
> >
> > I see, that seems to be the right way...
> >
> >
> > On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus 
> > wrote:
> >
> > > Shahab
> >
> >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-05 Thread Jianshi Huang
Hi Steven,

I did 1) and 2) and the error was during LoadIncrementalHFiles.

I can't do 3) because that CF is mostly used for mapreduce inputs, so a
continuous rowkey is preferred.

Jianshi



On Sat, Sep 6, 2014 at 12:29 AM, Magana-zook, Steven Alan <
maganazo...@llnl.gov> wrote:

> Jianshi,
>
> I have seen many solutions to importing this kind of data:
>
> 1. Pre-splitting regions (I did not try this)
>
> 2. Using a map reduce job to create HFiles instead of putting individual
> rows into the database
> (instructions here: http://hbase.apache.org/book/arch.bulk.load.html
>
> 3. Modifying the row key to not be monotonic
>
> I went with the third solution by pre-prending a random integer before the
> other fields in my composite row key ( "__ field 2>Š.")
>
> When you make any changes, you can verify it is working by viewing the
> Hbase web interface (port 60010 on the hbase master) to see the requests
> per second on the various region servers.
>
>
> Thank you,
> Steven Magana-Zook
>
>
>
>
>
>
> On 9/5/14 9:14 AM, "Jianshi Huang"  wrote:
>
> >Thanks Ted, I'll try to do a major compact.
> >
> >Hi Steven,
> >
> >Yes, most of my rows are hashed to make it randomly distributed, but one
> >column family has monotonically increasing rowkeys, and it's used for
> >recording sequence of events.
> >
> >Do you have a solution how to bulk import this kind of data?
> >
> >Jianshi
> >
> >
> >
> >On Sat, Sep 6, 2014 at 12:00 AM, Magana-zook, Steven Alan <
> >maganazo...@llnl.gov> wrote:
> >
> >> Hi Jianshi,
> >>
> >> What are the field(s) in your row key? If your row key is monotonically
> >> increasing then you will be sending all of your requests to one region
> >> server. Even after the region splits, all new entries will keep
> >>punishing
> >> one server (the region responsible for the split containing the new
> >>keys).
> >>
> >> See these articles that may help if this is indeed your issue:
> >> 1. http://hbase.apache.org/book/rowkey.design.html
> >> 2.
> >>
> >>
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-inc
> >>re
> >> asing-values-are-bad/
> >>
> >> Regards,
> >> Steven Magana-Zook
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 9/5/14 8:54 AM, "Jianshi Huang"  wrote:
> >>
> >> >Hi JM,
> >> >
> >> >What do you mean by the 'destination cluster'? The files are in the
> >>same
> >> >Hadoop/HDFS cluster where HBase is running.
> >> >
> >> >Do you mean do the bulk importing on HBase Master node?
> >> >
> >> >
> >> >Jianshi
> >> >
> >> >
> >> >On Fri, Sep 5, 2014 at 11:18 PM, Jean-Marc Spaggiari <
> >> >jean-m...@spaggiari.org> wrote:
> >> >
> >> >> Hi Jianshi,
> >> >>
> >> >> You might want to upload the file on the destination cluster first
> >>and
> >> >>then
> >> >> re-run your bulk load from there. That way the transfer time will
> >>not be
> >> >> taken into consideration for the timeout size the files will be
> >>local.
> >> >>
> >> >> JM
> >> >>
> >> >>
> >> >> 2014-09-05 11:15 GMT-04:00 Jianshi Huang :
> >> >>
> >> >> > I'm importing 2TB of generated HFiles to HBase and I constantly get
> >> >>the
> >> >> > following errors:
> >> >> >
> >> >> > Caused by:
> >> >> >
> >> >> >
> >> >>
> >>
> >>>>org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop
> >>>>.h
> >> >>base.RegionTooBusyException):
> >> >> > org.apache.hadoop.hbase.RegionTooBusyException: failed to get a
> >>lock
> >> >>in
> >> >> > 6 ms.
> >> >> >
> >> >> >
> >> >>
> >>
> >>>>regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de73
> >>>>3f
> >> >>3565642d96.,
> >> >> > server=x.xxx.xxx,60020,1404854700728
> >> >> > at
> >> >> >
> >>org.apache.hadoop.hbase.regionserver.HRegion.l

Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-05 Thread Jianshi Huang
Hi Shahab,

I see, that seems to be the right way...


On Sat, Sep 6, 2014 at 12:21 AM, Shahab Yunus 
wrote:

> Shahab





-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-05 Thread Jianshi Huang
Thanks Ted, I'll try to do a major compact.

Hi Steven,

Yes, most of my rows are hashed to make it randomly distributed, but one
column family has monotonically increasing rowkeys, and it's used for
recording sequence of events.

Do you have a solution how to bulk import this kind of data?

Jianshi



On Sat, Sep 6, 2014 at 12:00 AM, Magana-zook, Steven Alan <
maganazo...@llnl.gov> wrote:

> Hi Jianshi,
>
> What are the field(s) in your row key? If your row key is monotonically
> increasing then you will be sending all of your requests to one region
> server. Even after the region splits, all new entries will keep punishing
> one server (the region responsible for the split containing the new keys).
>
> See these articles that may help if this is indeed your issue:
> 1. http://hbase.apache.org/book/rowkey.design.html
> 2.
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-incre
> asing-values-are-bad/
>
> Regards,
> Steven Magana-Zook
>
>
>
>
>
>
> On 9/5/14 8:54 AM, "Jianshi Huang"  wrote:
>
> >Hi JM,
> >
> >What do you mean by the 'destination cluster'? The files are in the same
> >Hadoop/HDFS cluster where HBase is running.
> >
> >Do you mean do the bulk importing on HBase Master node?
> >
> >
> >Jianshi
> >
> >
> >On Fri, Sep 5, 2014 at 11:18 PM, Jean-Marc Spaggiari <
> >jean-m...@spaggiari.org> wrote:
> >
> >> Hi Jianshi,
> >>
> >> You might want to upload the file on the destination cluster first and
> >>then
> >> re-run your bulk load from there. That way the transfer time will not be
> >> taken into consideration for the timeout size the files will be local.
> >>
> >> JM
> >>
> >>
> >> 2014-09-05 11:15 GMT-04:00 Jianshi Huang :
> >>
> >> > I'm importing 2TB of generated HFiles to HBase and I constantly get
> >>the
> >> > following errors:
> >> >
> >> > Caused by:
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.h
> >>base.RegionTooBusyException):
> >> > org.apache.hadoop.hbase.RegionTooBusyException: failed to get a lock
> >>in
> >> > 6 ms.
> >> >
> >> >
> >>
> >>regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de733f
> >>3565642d96.,
> >> > server=x.xxx.xxx,60020,1404854700728
> >> > at
> >> > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5851)
> >> > at
> >> > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5837)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.regionserver.HRegion.startBulkRegionOperation(HRe
> >>gion.java:5795)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:
> >>3543)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:
> >>3525)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFile(HRegionS
> >>erver.java:3277)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.c
> >>allBlockingMethod(ClientProtos.java:28863)
> >> > at
> >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >> > at
> >>org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSche
> >>duler.java:160)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedu
> >>ler.java:38)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.j
> >>ava:110)
> >> > at java.lang.Thread.run(Thread.java:724)
> >> >
> >> > at
> >> org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1498)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1
> >>684)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.ca
> >>llBlockingMethod(RpcClient.java:1737)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$Blo
> >>ckingStub.bulkLoadHFile(ClientProtos.java:29276)
> >> > at
> >> >
> >> >
> >>
> >>org.apache.hadoop.hbase.protobuf.ProtobufUtil.bulkLoadHFile(ProtobufUtil.
> >>java:1548)
> >> > ... 11 more
> >> >
> >> >
> >> > What makes the region too busy? Is there a way to improve it?
> >> >
> >> > Does that also mean some part of my data are not correctly imported?
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > --
> >> > Jianshi Huang
> >> >
> >> > LinkedIn: jianshi
> >> > Twitter: @jshuang
> >> > Github & Blog: http://huangjs.github.com/
> >> >
> >>
> >
> >
> >
> >--
> >Jianshi Huang
> >
> >LinkedIn: jianshi
> >Twitter: @jshuang
> >Github & Blog: http://huangjs.github.com/
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-05 Thread Jianshi Huang
Hi JM,

What do you mean by the 'destination cluster'? The files are in the same
Hadoop/HDFS cluster where HBase is running.

Do you mean do the bulk importing on HBase Master node?


Jianshi


On Fri, Sep 5, 2014 at 11:18 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Hi Jianshi,
>
> You might want to upload the file on the destination cluster first and then
> re-run your bulk load from there. That way the transfer time will not be
> taken into consideration for the timeout size the files will be local.
>
> JM
>
>
> 2014-09-05 11:15 GMT-04:00 Jianshi Huang :
>
> > I'm importing 2TB of generated HFiles to HBase and I constantly get the
> > following errors:
> >
> > Caused by:
> >
> >
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.RegionTooBusyException):
> > org.apache.hadoop.hbase.RegionTooBusyException: failed to get a lock in
> > 6 ms.
> >
> >
> regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de733f3565642d96.,
> > server=x.xxx.xxx,60020,1404854700728
> > at
> > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5851)
> > at
> > org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5837)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.startBulkRegionOperation(HRegion.java:5795)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3543)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3525)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFile(HRegionServer.java:3277)
> > at
> >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28863)
> > at
> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> > at
> >
> >
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> > at
> >
> >
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> > at
> >
> >
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> > at java.lang.Thread.run(Thread.java:724)
> >
> > at
> org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1498)
> > at
> >
> >
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684)
> > at
> >
> >
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737)
> > at
> >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.bulkLoadHFile(ClientProtos.java:29276)
> > at
> >
> >
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.bulkLoadHFile(ProtobufUtil.java:1548)
> > ... 11 more
> >
> >
> > What makes the region too busy? Is there a way to improve it?
> >
> > Does that also mean some part of my data are not correctly imported?
> >
> >
> > Thanks,
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Help: RegionTooBusyException: failed to get a lock in 60000 ms

2014-09-05 Thread Jianshi Huang
I'm importing 2TB of generated HFiles to HBase and I constantly get the
following errors:

Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.RegionTooBusyException):
org.apache.hadoop.hbase.RegionTooBusyException: failed to get a lock in
6 ms.
regionName=grapple_edges_v2,ff00,1409817320781.6d2955c780b39523de733f3565642d96.,
server=x.xxx.xxx,60020,1404854700728
at
org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5851)
at
org.apache.hadoop.hbase.regionserver.HRegion.lock(HRegion.java:5837)
at
org.apache.hadoop.hbase.regionserver.HRegion.startBulkRegionOperation(HRegion.java:5795)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3543)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3525)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFile(HRegionServer.java:3277)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28863)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
at
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
at java.lang.Thread.run(Thread.java:724)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1498)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.bulkLoadHFile(ClientProtos.java:29276)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.bulkLoadHFile(ProtobufUtil.java:1548)
... 11 more


What makes the region too busy? Is there a way to improve it?

Does that also mean some part of my data are not correctly imported?


Thanks,

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: ResultScanner performance

2014-08-28 Thread Jianshi Huang
Ah, sure. That's a good idea. I know how to do it now. :)

Thanks for the help.

Jianshi


On Thu, Aug 28, 2014 at 12:29 PM, Ted Yu  wrote:

> You can enhance ColumnRangeFilter to return the first column in the range.
>
> In its filterKeyValue(Cell kv) method:
>
> int cmpMax = Bytes.compareTo(buffer, qualifierOffset, qualifierLength,
>
> this.maxColumn, 0, this.maxColumn.length);
>
> if (this.maxColumnInclusive && cmpMax <= 0 ||
>
> !this.maxColumnInclusive && cmpMax < 0) {
>
>   return ReturnCode.INCLUDE;
>
> }
>
> ReturnCode.NEXT_ROW should be returned (for subsequent columns) once
> ReturnCode.INCLUDE is returned for the first column in range.
>
> Cheers
>
>
> On Wed, Aug 27, 2014 at 9:05 PM, Jianshi Huang 
> wrote:
>
> > Very similar. We setup a column range (we're using ColumnRangeFilter
> right
> > now), and we want the first column in the range.
> >
> > The problem we have a lot of rows.
> >
> > If there's no such capability, then we need to control the parallelism
> > ourselves.
> >
> > Shall I sort the rows first before scanning? Will a random order be more
> > efficient if we have many servers?
> >
> > Jianshi
> >
> >
> > On Thu, Aug 28, 2014 at 1:44 AM, Ted Yu  wrote:
> >
> > > So you want to specify several columns. e.g. c2, c3, and c4, the GET is
> > > supposed to return the first one of them (doesn't have to be c2, can be
> > c3
> > > if c2 is absent) ?
> > >
> > > To my knowledge there is no such capability now.
> > >
> > > Cheers
> > >
> > >
> > > On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang <
> jianshi.hu...@gmail.com
> > >
> > > wrote:
> > >
> > > > On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang <
> > jianshi.hu...@gmail.com>
> > > > wrote:
> > > >
> > > > >
> > > > > There's a special but common case that for each row we only need
> the
> > > > first
> > > > > column. Is there a better way to do this than multiple scans +
> > take(1)?
> > > > >
> > > >
> > > > We still need to set a column range, is there a way to get the first
> > > column
> > > > value of a range using GET?
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: ResultScanner performance

2014-08-27 Thread Jianshi Huang
Very similar. We setup a column range (we're using ColumnRangeFilter right
now), and we want the first column in the range.

The problem we have a lot of rows.

If there's no such capability, then we need to control the parallelism
ourselves.

Shall I sort the rows first before scanning? Will a random order be more
efficient if we have many servers?

Jianshi


On Thu, Aug 28, 2014 at 1:44 AM, Ted Yu  wrote:

> So you want to specify several columns. e.g. c2, c3, and c4, the GET is
> supposed to return the first one of them (doesn't have to be c2, can be c3
> if c2 is absent) ?
>
> To my knowledge there is no such capability now.
>
> Cheers
>
>
> On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang 
> wrote:
>
> > On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang 
> > wrote:
> >
> > >
> > > There's a special but common case that for each row we only need the
> > first
> > > column. Is there a better way to do this than multiple scans + take(1)?
> > >
> >
> > We still need to set a column range, is there a way to get the first
> column
> > value of a range using GET?
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: ResultScanner performance

2014-08-27 Thread Jianshi Huang
On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang 
wrote:

>
> There's a special but common case that for each row we only need the first
> column. Is there a better way to do this than multiple scans + take(1)?
>

We still need to set a column range, is there a way to get the first column
value of a range using GET?


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: ResultScanner performance

2014-08-27 Thread Jianshi Huang
Hi,

The reason we cannot close the ResultScanner (or issue a multi-get), is
that we have wide rows with many columns, and we want to iterate over them
rather than get all the columns at once.

There's a special but common case that for each row we only need the first
column. Is there a better way to do this than multiple scans + take(1)?

Jianshi



On Wed, Aug 27, 2014 at 12:44 PM, Dai, Kevin  wrote:

> Hi, Ted
>
> I think you are right. But we must hold the ResultScanner for a while. So
> is there any way to reduce the performance loss? Or is there any way to
> share the connection?
>
> Best regards,
> Kevin.
>
> -Original Message-
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: 2014年8月27日 11:36
> To: user@hbase.apache.org
> Subject: Re: ResultScanner performance
>
> Keeping many ResultScanners open at the same time is not good for
> performance.
>
> Please see:
> http://hbase.apache.org/book.html#perf.hbase.client.scannerclose
>
> After fetching results from ResultScanner, you should close it ASAP.
>
> Cheers
>
>
> On Tue, Aug 26, 2014 at 8:18 PM, Dai, Kevin  wrote:
>
> > Hi, Ted
> >
> > We have a cluster of 48 machines and at least 100T data(which is still
> > increasing).
> > The problem is that we have a lot of row keys (about tens of thousands
> > ) to query in the meantime and we don't fetch all the data at once,
> > instead we fetch them when needed, so we may hold tens of thousands
> > ResultScanner in the meantime.
> > I want to know whether it will hurt the performance and network
> > resources and if so, is there any way to solve it?
> >
> > Best regards,
> > Kevin.
> > -Original Message-
> > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > Sent: 2014年8月26日 16:49
> > To: user@hbase.apache.org
> > Cc: user@hbase.apache.org; Huang, Jianshi
> > Subject: Re: ResultScanner performance
> >
> > Can you give a bit more detail ?
> > What size is the cluster / dataset ?
> > What problem are you solving ?
> > Would using coprocessor help reduce the usage of ResultScanner ?
> >
> > Cheers
> >
> > On Aug 26, 2014, at 12:13 AM, "Dai, Kevin"  wrote:
> >
> > > Hi, everyone
> > >
> > > My application will hold tens of thousands of ResultScanner to get
> Data.
> > Will it hurt the performance and network resources?
> > > If so, is there any way to solve it?
> > > Thanks,
> > > Kevin.
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Hbase InputFormat for multi-row + column range, how to do it?

2014-08-20 Thread Jianshi Huang
I see and I'll try. Thanks Andrey!

Jianshi


On Wed, Aug 20, 2014 at 6:01 PM, Andrey Stepachev  wrote:

> Hi Jianshi.
>
> You can create your own. Just inherit from TableInputFormatBase or
> TableInputFormat and add ColumnRangeFilter to scan (either construct your
> own, or intercept setScan method).
>
> Hope this helps.
>
> --
> Andrey.
>
>
> On Wed, Aug 20, 2014 at 1:35 PM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > I know TableInputFormat and HFileInputFormat can both set ROW_START and
> > ROW_END, but none of them can set the column range (like what we do in
> > ColumnRangeFilter).
> >
> > So how can I do column range in HBase InputFormat? Is there an
> > implementation available? If not, how much effort do you think it takes
> to
> > implement one?
> >
> > Best,
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>
>
>
> --
> Andrey.
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Hbase InputFormat for multi-row + column range, how to do it?

2014-08-20 Thread Jianshi Huang
Hi,

I know TableInputFormat and HFileInputFormat can both set ROW_START and
ROW_END, but none of them can set the column range (like what we do in
ColumnRangeFilter).

So how can I do column range in HBase InputFormat? Is there an
implementation available? If not, how much effort do you think it takes to
implement one?

Best,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: How are split files distributed across Region servers?

2014-08-19 Thread Jianshi Huang
Ok, I found some reference. I was actually asking the default load balancer
of HBase. And by googling, it seems it only makes the number of regions
even across region servers, but the distribution of regions are random.

Also found good load balancer implementation like this:


https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html

Thanks for the help JM! :)

Jianshi


On Tue, Aug 19, 2014 at 2:31 PM, lars hofhansl  wrote:

> I'd change the max file size to 20GB. That'd give you 5000 regions for
> 100TB.
>
>
>
> ________
>  From: Jianshi Huang 
> To: user@hbase.apache.org
> Sent: Monday, August 18, 2014 12:22 PM
> Subject: Re: How are split files distributed across Region servers?
>
>
> Hi JM,
>
> Make the range bigger you mean to make it multiple regions/splits, right?
>
> I probably will have >100TB of data, and I think the default split file
> size is 10GB. So I can assume each of my 100 machines will get assigned to
> 100 *random* regions?
>
> Where can I find the implementation details or settings for region
> assignment?
>
> Jianshi
>
>
>
> On Mon, Aug 18, 2014 at 8:48 PM, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> wrote:
>
> > Hi Jianshi,
> >
> > A region server can host more than one region. So if you pre-split your
> > table correctly based on your access usage, at the end all the servers
> > should be used evenly.
> >
> > If you have about 30% or your range which is not used, just make sure
> that
> > this range is bigger so at the end it will have the same load at the
> > others.
> >
> > JM
> >
> >
> > 2014-08-18 2:08 GMT-04:00 Jianshi Huang :
> >
> > > Hi JM,
> > >
> > > If the region boundaries will not change, does that mean,
> > >
> > > If my data access pattern has skews (say a certain part (30%) of my
> data
> > > will almost never be used), then a proportion (30%) of my server will
> > > always be idle?
> > >
> > > A region server has to have a continuous rowkey range?
> > >
> > > Jianshi
> > >
> > >
> > >
> > >
> > > On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
> > > jean-m...@spaggiari.org> wrote:
> > >
> > > > H Jianshi,
> > > >
> > > > Not sure to get your question.
> > > >
> > > > Can I rephrase it?
> > > >
> > > > So you have 10 regions, and each of those regions has 10 HFiles. Then
> > you
> > > > run a major compaction on the table. Correct?
> > > >
> > > > Then you will end up with:
> > > >
> > > > reg1:[files:1]
> > > > reg2:[files:2]
> > > > reg3:[files:3]
> > > > ...
> > > >
> > > > Regions boundaries will not change. But each region will not have a
> > > single
> > > > underlaying file.
> > > >
> > > > HTH,
> > > >
> > > > JM
> > > >
> > > >
> > > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang :
> > > >
> > > > > Say I have 100 split files on 10 region servers, and I did a major
> > > > compact.
> > > > >
> > > > > Will these split files be distributed like this:
> > > > > reg1: [splits 1,2,..,10]
> > > > > reg2: [splits 11,12,...,20]
> > > > > ...
> > > > >
> > > > > Or like this:
> > > > > reg1: [splits: 1, 11, 21, ... , 91]
> > > > > reg2: [splits: 2, 12, 22, ... , 92]
> > > > > ...
> > > > >
> > > > > And if I want to specify the locality and the stride of split
> files?
> > > How
> > > > > can I do it in HBase?
> > > > >
> > > > >
> > > > > --
> > > > > Jianshi Huang
> > > > >
> > > > > LinkedIn: jianshi
> > > > > Twitter: @jshuang
> > > > > Github & Blog: http://huangjs.github.com/
>
>
>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: How are split files distributed across Region servers?

2014-08-18 Thread Jianshi Huang
Hi JM,

Make the range bigger you mean to make it multiple regions/splits, right?

I probably will have >100TB of data, and I think the default split file
size is 10GB. So I can assume each of my 100 machines will get assigned to
100 *random* regions?

Where can I find the implementation details or settings for region
assignment?

Jianshi



On Mon, Aug 18, 2014 at 8:48 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Hi Jianshi,
>
> A region server can host more than one region. So if you pre-split your
> table correctly based on your access usage, at the end all the servers
> should be used evenly.
>
> If you have about 30% or your range which is not used, just make sure that
> this range is bigger so at the end it will have the same load at the
> others.
>
> JM
>
>
> 2014-08-18 2:08 GMT-04:00 Jianshi Huang :
>
> > Hi JM,
> >
> > If the region boundaries will not change, does that mean,
> >
> > If my data access pattern has skews (say a certain part (30%) of my data
> > will almost never be used), then a proportion (30%) of my server will
> > always be idle?
> >
> > A region server has to have a continuous rowkey range?
> >
> > Jianshi
> >
> >
> >
> >
> > On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
> > jean-m...@spaggiari.org> wrote:
> >
> > > H Jianshi,
> > >
> > > Not sure to get your question.
> > >
> > > Can I rephrase it?
> > >
> > > So you have 10 regions, and each of those regions has 10 HFiles. Then
> you
> > > run a major compaction on the table. Correct?
> > >
> > > Then you will end up with:
> > >
> > > reg1:[files:1]
> > > reg2:[files:2]
> > > reg3:[files:3]
> > > ...
> > >
> > > Regions boundaries will not change. But each region will not have a
> > single
> > > underlaying file.
> > >
> > > HTH,
> > >
> > > JM
> > >
> > >
> > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang :
> > >
> > > > Say I have 100 split files on 10 region servers, and I did a major
> > > compact.
> > > >
> > > > Will these split files be distributed like this:
> > > > reg1: [splits 1,2,..,10]
> > > > reg2: [splits 11,12,...,20]
> > > > ...
> > > >
> > > > Or like this:
> > > > reg1: [splits: 1, 11, 21, ... , 91]
> > > > reg2: [splits: 2, 12, 22, ... , 92]
> > > > ...
> > > >
> > > > And if I want to specify the locality and the stride of split files?
> > How
> > > > can I do it in HBase?
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: How are split files distributed across Region servers?

2014-08-17 Thread Jianshi Huang
Hi JM,

If the region boundaries will not change, does that mean,

If my data access pattern has skews (say a certain part (30%) of my data
will almost never be used), then a proportion (30%) of my server will
always be idle?

A region server has to have a continuous rowkey range?

Jianshi




On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> H Jianshi,
>
> Not sure to get your question.
>
> Can I rephrase it?
>
> So you have 10 regions, and each of those regions has 10 HFiles. Then you
> run a major compaction on the table. Correct?
>
> Then you will end up with:
>
> reg1:[files:1]
> reg2:[files:2]
> reg3:[files:3]
> ...
>
> Regions boundaries will not change. But each region will not have a single
> underlaying file.
>
> HTH,
>
> JM
>
>
> 2014-08-15 1:53 GMT-04:00 Jianshi Huang :
>
> > Say I have 100 split files on 10 region servers, and I did a major
> compact.
> >
> > Will these split files be distributed like this:
> > reg1: [splits 1,2,..,10]
> > reg2: [splits 11,12,...,20]
> > ...
> >
> > Or like this:
> > reg1: [splits: 1, 11, 21, ... , 91]
> > reg2: [splits: 2, 12, 22, ... , 92]
> > ...
> >
> > And if I want to specify the locality and the stride of split files? How
> > can I do it in HBase?
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


How are split files distributed across Region servers?

2014-08-14 Thread Jianshi Huang
Say I have 100 split files on 10 region servers, and I did a major compact.

Will these split files be distributed like this:
reg1: [splits 1,2,..,10]
reg2: [splits 11,12,...,20]
...

Or like this:
reg1: [splits: 1, 11, 21, ... , 91]
reg2: [splits: 2, 12, 22, ... , 92]
...

And if I want to specify the locality and the stride of split files? How
can I do it in HBase?


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: How to create a connection pool with specified pool size?

2014-08-11 Thread Jianshi Huang
I see. Thank you Ted for the help. :)

Jianshi


On Mon, Aug 11, 2014 at 9:57 PM, Ted Yu  wrote:

> If you use the following method:
>
> public static HConnection
> <
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html
> >
> createConnection(org.apache.hadoop.conf.Configuration conf,
>ExecutorService
> <
> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html?is-external=true
> >
> pool)
>
> You can pass your own ExecutorService.
>
> See example in
> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html?is-external=true
>
> Cheers
>
>
>
> On Mon, Aug 11, 2014 at 2:40 AM, Jianshi Huang 
> wrote:
>
> > I followed the manual and uses HConnectionManager.createConnection to
> > create a connection pool.
> >
> > However I couldn't find reference about how to specify the pool size? It
> > should be in the second parameter pool of type ExecutorService, right?
> How
> > can I do that?
> >
> > Cheers,
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


How to create a connection pool with specified pool size?

2014-08-11 Thread Jianshi Huang
I followed the manual and uses HConnectionManager.createConnection to
create a connection pool.

However I couldn't find reference about how to specify the pool size? It
should be in the second parameter pool of type ExecutorService, right? How
can I do that?

Cheers,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Best practice for writing to HFileOutputFormat(2) with multiple Column Families

2014-08-01 Thread Jianshi Huang
I know HBase will set the TotalOrderPartitioner in MR, but in Spark, I need
to sort the rows myself.

Jianshi



On Sat, Aug 2, 2014 at 12:24 AM, Arun Allamsetty 
wrote:

> Hi Jianshi,
>
> Do you mean that you want to sort the row keys? If yes, then you don't have
> to worry about it because HBase sorts the row keys on its own but
> lexicographically.
>
> Cheers,
> Arun
>
> Sent from a mobile device. Please don't mind the typos.
> On Jul 30, 2014 9:02 PM, "Jianshi Huang"  wrote:
>
> > I need to generate from a 2TB dataset and exploded it to 4 Column
> Families.
> >
> > The result dataset is likely to be 20TB or more. I'm currently using
> Spark
> > so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
> > optimize it.
> >
> > My question is:
> > Should I sort and write each column family one by one, or should I put
> them
> > all together then do sort and write?
> >
> > Does my question make sense?
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Best practice for writing to HFileOutputFormat(2) with multiple Column Families

2014-07-30 Thread Jianshi Huang
I need to generate from a 2TB dataset and exploded it to 4 Column Families.

The result dataset is likely to be 20TB or more. I'm currently using Spark
so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
optimize it.

My question is:
Should I sort and write each column family one by one, or should I put them
all together then do sort and write?

Does my question make sense?

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Completebulkload with namespace option?

2014-07-30 Thread Jianshi Huang
Wow, thanks! :)


On Thu, Jul 31, 2014 at 10:07 AM, Ted Yu  wrote:

> Matteo acted very fast - this has been fixed by HBASE-11609
>
> Cheers
>
>
> On Wed, Jul 30, 2014 at 7:02 PM, Jianshi Huang 
> wrote:
>
> > Created a Jira issue.
> >
> > https://issues.apache.org/jira/browse/HBASE-11622
> >
> >
> > On Tue, Jul 29, 2014 at 11:46 PM, Bharath Vissapragada <
> > bhara...@cloudera.com> wrote:
> >
> > > Appears to be a bug. It should be TableName.valueOf(...) or something
> > > similar. Mind filing a jira?
> > >
> > >
> > > On Tue, Jul 29, 2014 at 12:22 PM, Jianshi Huang <
> jianshi.hu...@gmail.com
> > >
> > > wrote:
> > >
> > > > I see why, looking at the source code of LoadIncrementalHFiles.java,
> it
> > > > seems the temporary path created for splitting will contain ':',
> > > >
> > > > The error part should be this:
> > > > String uniqueName = getUniqueName(table.getName());
> > > > HColumnDescriptor familyDesc =
> > > > table.getTableDescriptor().getFamily(item.family);
> > > > Path botOut = new Path(tmpDir, uniqueName + ".bottom");
> > > > Path topOut = new Path(tmpDir, uniqueName + ".top");
> > > > splitStoreFile(getConf(), hfilePath, familyDesc, splitKey,
> > > > botOut, topOut);
> > > >
> > > > uniqueName will be "namespce:table" so new Path will fail.
> > > >
> > > > A bug right?
> > > >
> > > > Jianshi
> > > >
> > > >
> > > > On Tue, Jul 29, 2014 at 2:42 PM, Jianshi Huang <
> > jianshi.hu...@gmail.com>
> > > > wrote:
> > > >
> > > > > I'm using hbase 0.98 with HDP 2.1.
> > > > >
> > > > >
> > > > > On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang <
> > > jianshi.hu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> I'm using completebulkload to load 500GB of data to a table
> > > > >> (presplitted). However, it reports the following errors:
> > > > >>
> > > > >> Looks like completebulkload didn't recognize the namespace part
> > > > >> (namespace:table).
> > > > >>
> > > > >> Is there an option to do it? I can't find one in Google...
> > > > >>
> > > > >> Exception in thread "main" 14/07/28 23:32:19 INFO
> > > > >> mapreduce.LoadIncrementalHFiles: Trying to load
> > > > >> hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3
> > > > >> first=dc595cfe#cust#1812199228741466242
> > > > >>  last=dc68cedc#cust#2251647837553603393
> > > > >> java.lang.reflect.InvocationTargetException
> > > > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > >> at java.lang.reflect.Method.invoke(Method.java:606)
> > > > >> at
> > > org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54)
> > > > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > >> at
> > > > >>
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > >> at java.lang.reflect.Method.invoke(Method.java:606)
> > > > >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> > > > >> Caused by: java.lang.IllegalStateException:
> > > > >> java.lang.IllegalArgumentException: java.net.URISyntaxException:
> > > > Relative
> > > > >> path in absolute URI: grapple:vertices,37.bottom
> > > > >> at
> > > > >>
> > > 

Re: Completebulkload with namespace option?

2014-07-30 Thread Jianshi Huang
Created a Jira issue.

https://issues.apache.org/jira/browse/HBASE-11622


On Tue, Jul 29, 2014 at 11:46 PM, Bharath Vissapragada <
bhara...@cloudera.com> wrote:

> Appears to be a bug. It should be TableName.valueOf(...) or something
> similar. Mind filing a jira?
>
>
> On Tue, Jul 29, 2014 at 12:22 PM, Jianshi Huang 
> wrote:
>
> > I see why, looking at the source code of LoadIncrementalHFiles.java, it
> > seems the temporary path created for splitting will contain ':',
> >
> > The error part should be this:
> > String uniqueName = getUniqueName(table.getName());
> > HColumnDescriptor familyDesc =
> > table.getTableDescriptor().getFamily(item.family);
> > Path botOut = new Path(tmpDir, uniqueName + ".bottom");
> > Path topOut = new Path(tmpDir, uniqueName + ".top");
> > splitStoreFile(getConf(), hfilePath, familyDesc, splitKey,
> > botOut, topOut);
> >
> > uniqueName will be "namespce:table" so new Path will fail.
> >
> > A bug right?
> >
> > Jianshi
> >
> >
> > On Tue, Jul 29, 2014 at 2:42 PM, Jianshi Huang 
> > wrote:
> >
> > > I'm using hbase 0.98 with HDP 2.1.
> > >
> > >
> > > On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> > > wrote:
> > >
> > >> I'm using completebulkload to load 500GB of data to a table
> > >> (presplitted). However, it reports the following errors:
> > >>
> > >> Looks like completebulkload didn't recognize the namespace part
> > >> (namespace:table).
> > >>
> > >> Is there an option to do it? I can't find one in Google...
> > >>
> > >> Exception in thread "main" 14/07/28 23:32:19 INFO
> > >> mapreduce.LoadIncrementalHFiles: Trying to load
> > >> hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3
> > >> first=dc595cfe#cust#1812199228741466242
> > >>  last=dc68cedc#cust#2251647837553603393
> > >> java.lang.reflect.InvocationTargetException
> > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >> at
> > >>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >> at
> > >>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >> at java.lang.reflect.Method.invoke(Method.java:606)
> > >> at
> org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54)
> > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >> at
> > >>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >> at
> > >>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >> at java.lang.reflect.Method.invoke(Method.java:606)
> > >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> > >> Caused by: java.lang.IllegalStateException:
> > >> java.lang.IllegalArgumentException: java.net.URISyntaxException:
> > Relative
> > >> path in absolute URI: grapple:vertices,37.bottom
> > >> at
> > >>
> >
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421)
> > >> at
> > >>
> >
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291)
> > >> at
> > >>
> >
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825)
> > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> > >> at
> > >>
> >
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831)
> > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >> at
> > >>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >> at
> > >>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >> at java.lang.reflect.Method.invoke(Method.java:606)

Re: Completebulkload with namespace option?

2014-07-28 Thread Jianshi Huang
I see why, looking at the source code of LoadIncrementalHFiles.java, it
seems the temporary path created for splitting will contain ':',

The error part should be this:
String uniqueName = getUniqueName(table.getName());
HColumnDescriptor familyDesc =
table.getTableDescriptor().getFamily(item.family);
Path botOut = new Path(tmpDir, uniqueName + ".bottom");
Path topOut = new Path(tmpDir, uniqueName + ".top");
splitStoreFile(getConf(), hfilePath, familyDesc, splitKey,
botOut, topOut);

uniqueName will be "namespce:table" so new Path will fail.

A bug right?

Jianshi


On Tue, Jul 29, 2014 at 2:42 PM, Jianshi Huang 
wrote:

> I'm using hbase 0.98 with HDP 2.1.
>
>
> On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang 
> wrote:
>
>> I'm using completebulkload to load 500GB of data to a table
>> (presplitted). However, it reports the following errors:
>>
>> Looks like completebulkload didn't recognize the namespace part
>> (namespace:table).
>>
>> Is there an option to do it? I can't find one in Google...
>>
>> Exception in thread "main" 14/07/28 23:32:19 INFO
>> mapreduce.LoadIncrementalHFiles: Trying to load
>> hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3
>> first=dc595cfe#cust#1812199228741466242
>>  last=dc68cedc#cust#2251647837553603393
>> java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>> Caused by: java.lang.IllegalStateException:
>> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
>> path in absolute URI: grapple:vertices,37.bottom
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421)
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291)
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
>> at
>> org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
>> ... 10 more
>> Caused by: java.lang.IllegalArgumentException:
>> java.net.URISyntaxException: Relative path in absolute URI:
>> grapple:vertices,37.bottom
>> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>> at org.apache.hadoop.fs.Path.(Path.java:172)
>> at org.apache.hadoop.fs.Path.(Path.java:94)
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.splitStoreFile(LoadIncrementalHFiles.java:450)
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:516)
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:400)
>> at
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:398)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(T

Re: Completebulkload with namespace option?

2014-07-28 Thread Jianshi Huang
I'm using hbase 0.98 with HDP 2.1.


On Tue, Jul 29, 2014 at 2:39 PM, Jianshi Huang 
wrote:

> I'm using completebulkload to load 500GB of data to a table (presplitted).
> However, it reports the following errors:
>
> Looks like completebulkload didn't recognize the namespace part
> (namespace:table).
>
> Is there an option to do it? I can't find one in Google...
>
> Exception in thread "main" 14/07/28 23:32:19 INFO
> mapreduce.LoadIncrementalHFiles: Trying to load
> hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3
> first=dc595cfe#cust#1812199228741466242
>  last=dc68cedc#cust#2251647837553603393
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: java.lang.IllegalStateException:
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
> path in absolute URI: grapple:vertices,37.bottom
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421)
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291)
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
> at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
> ... 10 more
> Caused by: java.lang.IllegalArgumentException:
> java.net.URISyntaxException: Relative path in absolute URI:
> grapple:vertices,37.bottom
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172)
> at org.apache.hadoop.fs.Path.(Path.java:94)
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.splitStoreFile(LoadIncrementalHFiles.java:450)
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:516)
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:400)
> at
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:398)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> grapple:vertices,37.bottom
> at java.net.URI.checkPath(URI.java:1804)
> at java.net.URI.(URI.java:752)
> at org.apache.hadoop.fs.Path.initialize(Path.java:203)
> ... 10 more
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Completebulkload with namespace option?

2014-07-28 Thread Jianshi Huang
I'm using completebulkload to load 500GB of data to a table (presplitted).
However, it reports the following errors:

Looks like completebulkload didn't recognize the namespace part
(namespace:table).

Is there an option to do it? I can't find one in Google...

Exception in thread "main" 14/07/28 23:32:19 INFO
mapreduce.LoadIncrementalHFiles: Trying to load
hfile=hdfs://xxx/vertices/PROP/f5cbf0965ff44cb8bdabd038e66485c3
first=dc595cfe#cust#1812199228741466242
 last=dc68cedc#cust#2251647837553603393
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.IllegalStateException:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
path in absolute URI: grapple:vertices,37.bottom
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplitPhase(LoadIncrementalHFiles.java:421)
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:291)
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:825)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:831)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
... 10 more
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: grapple:vertices,37.bottom
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at org.apache.hadoop.fs.Path.(Path.java:94)
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.splitStoreFile(LoadIncrementalHFiles.java:450)
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.groupOrSplit(LoadIncrementalHFiles.java:516)
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:400)
at
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$2.call(LoadIncrementalHFiles.java:398)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.URISyntaxException: Relative path in absolute URI:
grapple:vertices,37.bottom
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
    ... 10 more

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Scan columns of a row within a Range

2014-07-17 Thread Jianshi Huang
Yes, I found the info from a nice blog article. Thanks Ted!

Jianshi


On Thu, Jul 17, 2014 at 10:07 PM, Ted Yu  wrote:

> ColumnRangeFilter implements getNextCellHint() in facilitating jumping to
> the minColumn.
> When current column is past maxColumn, it skips to next row.
>
> So ColumnRangeFilter is very effective.
>
> Cheers
>
>
> On Thu, Jul 17, 2014 at 12:45 AM, Jianshi Huang 
> wrote:
>
> > Hi Esteban,
> >
> > Yes, I found it moments ago. Is it as efficient as the Row scan?
> >
> > And can I have millions of columns for a row with no or little
> performance
> > impaction? (the traditional tall vs wide problem, the hbase manual
> > recommends tall table than wide table).
> >
> >
> > Jianshi
> >
> >
> > On Thu, Jul 17, 2014 at 3:01 PM, Esteban Gutierrez  >
> > wrote:
> >
> > > Hi Jianshi,
> > >
> > > Have you looked into the ColumnRangeFilter?
> > >
> > >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html
> > >
> > > cheers,
> > > esteban.
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > >
> > > On Wed, Jul 16, 2014 at 11:40 PM, Jianshi Huang <
> jianshi.hu...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I scanned through HBase' Scan API and couldn't find out how to scan a
> > > range
> > > > of columns in a row.
> > > >
> > > > It seems I can only do scan(startRow, endRow), which are both just
> > > RowKeys.
> > > >
> > > > What's the most efficient way to do it? Should I use a Filter? I
> heard
> > > > filter is not as efficient as RK scans, how much slower is it?
> > > >
> > > > (BTW, I was using Accumulo for the same thing and it has a really
> nice
> > > API
> > > > (Range, Key) for it. A Key is a combination of RK+CF+CQ+TS.)
> > > >
> > > > Am I missing anything?
> > > >
> > > > Cheers,
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Scan columns of a row within a Range

2014-07-17 Thread Jianshi Huang
Hi Esteban,

Yes, I found it moments ago. Is it as efficient as the Row scan?

And can I have millions of columns for a row with no or little performance
impaction? (the traditional tall vs wide problem, the hbase manual
recommends tall table than wide table).


Jianshi


On Thu, Jul 17, 2014 at 3:01 PM, Esteban Gutierrez 
wrote:

> Hi Jianshi,
>
> Have you looked into the ColumnRangeFilter?
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
>
> On Wed, Jul 16, 2014 at 11:40 PM, Jianshi Huang 
> wrote:
>
> > Hi,
> >
> > I scanned through HBase' Scan API and couldn't find out how to scan a
> range
> > of columns in a row.
> >
> > It seems I can only do scan(startRow, endRow), which are both just
> RowKeys.
> >
> > What's the most efficient way to do it? Should I use a Filter? I heard
> > filter is not as efficient as RK scans, how much slower is it?
> >
> > (BTW, I was using Accumulo for the same thing and it has a really nice
> API
> > (Range, Key) for it. A Key is a combination of RK+CF+CQ+TS.)
> >
> > Am I missing anything?
> >
> > Cheers,
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Scan columns of a row within a Range

2014-07-16 Thread Jianshi Huang
Hi,

I scanned through HBase' Scan API and couldn't find out how to scan a range
of columns in a row.

It seems I can only do scan(startRow, endRow), which are both just RowKeys.

What's the most efficient way to do it? Should I use a Filter? I heard
filter is not as efficient as RK scans, how much slower is it?

(BTW, I was using Accumulo for the same thing and it has a really nice API
(Range, Key) for it. A Key is a combination of RK+CF+CQ+TS.)

Am I missing anything?

Cheers,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/