Re: Why Hadoop can't find Reducer when Mapper reads data from HBase?

2012-07-12 Thread yonghu
Strage thing is the same program works fine in the cluster.  By the
way, also in pseudo mode when MapReduce read data from Cassandra in
Map phase and transferred to Reduce phase, the same error happened.

regards!

Yong

On Thu, Jul 12, 2012 at 2:01 PM, Stack  wrote:
> On Thu, Jul 12, 2012 at 1:15 PM, yonghu  wrote:
>> java.lang.RuntimeException: java.lang.ClassNotFoundException:
>> com.mapreducetablescan.MRTableAccess$MTableReducer;
>>
>> Does anybody know why?
>>
>
> Its not in your job jar?  Check the job jar (jar -tf JAR_FILE).
>
> St.Ack


Re: Embedded table data model

2012-07-12 Thread Ian Varley
Yes, that's what I mean.

It is not the only way to model this, but your question was, "Can we embedded 
the transactions inside the customer table in HBase".



On Jul 12, 2012, at 8:21 PM, "Xiaobo Gu" 
mailto:guxiaobo1...@gmail.com>> wrote:

Hi Ian,

Do you mean each transaction will be created as a column inside the cf
for transactions, and these columns are created dynamically as
transactions occur?

Regards,

Xiaobo Gu

On Fri, Jul 13, 2012 at 11:08 AM, Ian Varley 
mailto:ivar...@salesforce.com>> wrote:
Column families are not the same thing as columns. You should indeed have a 
small number of column families, as that article points out. Columns (aka 
column qualifiers) are run-time defined key/value pairs that contain the data 
for every row, and having large numbers of these is fine.



On Jul 12, 2012, at 7:27 PM, "Cole" 
mailto:heshua...@gmail.com>> wrote:

I think this design has some question, please refer
http://hbase.apache.org/book/number.of.cfs.html

2012/7/12 Ian Varley mailto:ivar...@salesforce.com>>

Yes, that's fine; you can always do a single column PUT into an existing
row, in a concurrency-safe way, and the lock on the row is only held as
long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
architecture, that's efficient because the PUT only goes to memory, and is
merged with on-disk records at read time (until a regular flush or
compaction happens).

So even though you already have, say, 10K transactions in the table, it's
still efficient to PUT a single new transaction in (whether that's in the
middle of the sorted list of columns, at the end, etc.)

Ian

On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:

but they are other writers insert new transactions into the table when
customers do new transactions.

On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley 
mailto:ivar...@salesforce.com>
> wrote:
Hi Xiaobo -

For HBase, this is doable; you could have a single table in HBase where
each row is a customer (with the customerid as the rowkey), and columns for
each of the 300 attributes that are directly part of the customer entity.
This is sparse, so you'd only take up space for the attributes that
actually exist for each customer.

You could then have (possibly in another column family, but not
necessarily) an additional column for each transaction, where the column
name is composed of a date concatenated with the transaction id, in which
you store the 30 attributes as serialized into a single byte array in the
cell value. (Or, you could alternately do each attribute as its own column
but there's no advantage to doing so, since presumably a transaction is
roughly like an immutable event that you wouldn't typically change just a
single attribute of.) A schema for this (if spelled out in an xml
representation) could be:



 


 
 
 ...
 


 
   
 
 
   
   
   
   ...
   
 



(This isn't real HBase syntax, it's just an abstract way to show you the
structure.) In practice, HBase isn't doing anything "special" with the
entity that lives nested inside your table; it's just a matter of
convention, that you could "see" it that way. The customer-level attributes
(like, say, "customer_name" and "customer_address") would be literal column
names (aka column qualifiers) embedded in your code, whereas the
transaction-oriented columns would be created at runtime with column names
like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
objects (containing the 30 attributes) serialized into a byte array.

In this scenario, you get fast access to any customer by ID, and further
to a range of transactions by date (using, say, a column pagination
filter). This would perform roughly equivalently regardless of how many
customers are in the table, or how many transactions exist for each
customer. What you'd lose on this design would be the ability to get a
single transaction for a single customer by ID (since you're storing them
by date). But if you need that, you could actually store it both ways. You
also might be introducing some extra contention on concurrent transaction
PUT requests for a single client, because they'd have to fight over a lock
for the row (but that's probably not a big deal, since it's only
contentious within each customer).

You might find my presentation on designing HBase schemas (from this
year's HBaseCon) useful:

http://www.hbasecon.com/sessions/hbase-schema-design-2/

Ian

On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:

Hi,

I have technical problem, and wander whether HBase or Cassandra
support Embedded table data model, or can somebody show me a way to do
this:

1.We have a very large customer entity table which have 100 milliion
rows, each customer row has about 300 attributes(columns).
2.Each customer do about 1000 transactions per year, each transaction
has about 30 attributes(columns), and we just save one year
transactions for each customer

We want a data model that  we can get the customer entity with

Re: Embedded table data model

2012-07-12 Thread Xiaobo Gu
Hi Ian,

Do you mean each transaction will be created as a column inside the cf
for transactions, and these columns are created dynamically as
transactions occur?

Regards,

Xiaobo Gu

On Fri, Jul 13, 2012 at 11:08 AM, Ian Varley  wrote:
> Column families are not the same thing as columns. You should indeed have a 
> small number of column families, as that article points out. Columns (aka 
> column qualifiers) are run-time defined key/value pairs that contain the data 
> for every row, and having large numbers of these is fine.
>
>
>
> On Jul 12, 2012, at 7:27 PM, "Cole"  wrote:
>
>> I think this design has some question, please refer
>> http://hbase.apache.org/book/number.of.cfs.html
>>
>> 2012/7/12 Ian Varley 
>>
>>> Yes, that's fine; you can always do a single column PUT into an existing
>>> row, in a concurrency-safe way, and the lock on the row is only held as
>>> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
>>> architecture, that's efficient because the PUT only goes to memory, and is
>>> merged with on-disk records at read time (until a regular flush or
>>> compaction happens).
>>>
>>> So even though you already have, say, 10K transactions in the table, it's
>>> still efficient to PUT a single new transaction in (whether that's in the
>>> middle of the sorted list of columns, at the end, etc.)
>>>
>>> Ian
>>>
>>> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
>>>
>>> but they are other writers insert new transactions into the table when
>>> customers do new transactions.
>>>
>>> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley >> > wrote:
>>> Hi Xiaobo -
>>>
>>> For HBase, this is doable; you could have a single table in HBase where
>>> each row is a customer (with the customerid as the rowkey), and columns for
>>> each of the 300 attributes that are directly part of the customer entity.
>>> This is sparse, so you'd only take up space for the attributes that
>>> actually exist for each customer.
>>>
>>> You could then have (possibly in another column family, but not
>>> necessarily) an additional column for each transaction, where the column
>>> name is composed of a date concatenated with the transaction id, in which
>>> you store the 30 attributes as serialized into a single byte array in the
>>> cell value. (Or, you could alternately do each attribute as its own column
>>> but there's no advantage to doing so, since presumably a transaction is
>>> roughly like an immutable event that you wouldn't typically change just a
>>> single attribute of.) A schema for this (if spelled out in an xml
>>> representation) could be:
>>>
>>> 
>>> 
>>>   
>>> 
>>> 
>>>   
>>>   
>>>   ...
>>>   
>>> 
>>> 
>>>   
>>> 
>>>   
>>>   
>>> 
>>> 
>>> 
>>> ...
>>> 
>>>   
>>> 
>>> 
>>>
>>> (This isn't real HBase syntax, it's just an abstract way to show you the
>>> structure.) In practice, HBase isn't doing anything "special" with the
>>> entity that lives nested inside your table; it's just a matter of
>>> convention, that you could "see" it that way. The customer-level attributes
>>> (like, say, "customer_name" and "customer_address") would be literal column
>>> names (aka column qualifiers) embedded in your code, whereas the
>>> transaction-oriented columns would be created at runtime with column names
>>> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
>>> objects (containing the 30 attributes) serialized into a byte array.
>>>
>>> In this scenario, you get fast access to any customer by ID, and further
>>> to a range of transactions by date (using, say, a column pagination
>>> filter). This would perform roughly equivalently regardless of how many
>>> customers are in the table, or how many transactions exist for each
>>> customer. What you'd lose on this design would be the ability to get a
>>> single transaction for a single customer by ID (since you're storing them
>>> by date). But if you need that, you could actually store it both ways. You
>>> also might be introducing some extra contention on concurrent transaction
>>> PUT requests for a single client, because they'd have to fight over a lock
>>> for the row (but that's probably not a big deal, since it's only
>>> contentious within each customer).
>>>
>>> You might find my presentation on designing HBase schemas (from this
>>> year's HBaseCon) useful:
>>>
>>> http://www.hbasecon.com/sessions/hbase-schema-design-2/
>>>
>>> Ian
>>>
>>> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:
>>>
>>> Hi,
>>>
>>> I have technical problem, and wander whether HBase or Cassandra
>>> support Embedded table data model, or can somebody show me a way to do
>>> this:
>>>
>>> 1.We have a very large customer entity table which have 100 milliion
>>> rows, each customer row has about 300 attributes(columns).
>>> 2.Each customer do about 1000 transactions per year, each transaction
>>> has about 30 attributes(columns), and we just save one year
>>> transactions for each 

Re: Embedded table data model

2012-07-12 Thread Ian Varley
Column families are not the same thing as columns. You should indeed have a 
small number of column families, as that article points out. Columns (aka 
column qualifiers) are run-time defined key/value pairs that contain the data 
for every row, and having large numbers of these is fine. 



On Jul 12, 2012, at 7:27 PM, "Cole"  wrote:

> I think this design has some question, please refer
> http://hbase.apache.org/book/number.of.cfs.html
> 
> 2012/7/12 Ian Varley 
> 
>> Yes, that's fine; you can always do a single column PUT into an existing
>> row, in a concurrency-safe way, and the lock on the row is only held as
>> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
>> architecture, that's efficient because the PUT only goes to memory, and is
>> merged with on-disk records at read time (until a regular flush or
>> compaction happens).
>> 
>> So even though you already have, say, 10K transactions in the table, it's
>> still efficient to PUT a single new transaction in (whether that's in the
>> middle of the sorted list of columns, at the end, etc.)
>> 
>> Ian
>> 
>> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
>> 
>> but they are other writers insert new transactions into the table when
>> customers do new transactions.
>> 
>> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley > > wrote:
>> Hi Xiaobo -
>> 
>> For HBase, this is doable; you could have a single table in HBase where
>> each row is a customer (with the customerid as the rowkey), and columns for
>> each of the 300 attributes that are directly part of the customer entity.
>> This is sparse, so you'd only take up space for the attributes that
>> actually exist for each customer.
>> 
>> You could then have (possibly in another column family, but not
>> necessarily) an additional column for each transaction, where the column
>> name is composed of a date concatenated with the transaction id, in which
>> you store the 30 attributes as serialized into a single byte array in the
>> cell value. (Or, you could alternately do each attribute as its own column
>> but there's no advantage to doing so, since presumably a transaction is
>> roughly like an immutable event that you wouldn't typically change just a
>> single attribute of.) A schema for this (if spelled out in an xml
>> representation) could be:
>> 
>> 
>> 
>>   
>> 
>> 
>>   
>>   
>>   ...
>>   
>> 
>> 
>>   
>> 
>>   
>>   
>> 
>> 
>> 
>> ...
>> 
>>   
>> 
>> 
>> 
>> (This isn't real HBase syntax, it's just an abstract way to show you the
>> structure.) In practice, HBase isn't doing anything "special" with the
>> entity that lives nested inside your table; it's just a matter of
>> convention, that you could "see" it that way. The customer-level attributes
>> (like, say, "customer_name" and "customer_address") would be literal column
>> names (aka column qualifiers) embedded in your code, whereas the
>> transaction-oriented columns would be created at runtime with column names
>> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
>> objects (containing the 30 attributes) serialized into a byte array.
>> 
>> In this scenario, you get fast access to any customer by ID, and further
>> to a range of transactions by date (using, say, a column pagination
>> filter). This would perform roughly equivalently regardless of how many
>> customers are in the table, or how many transactions exist for each
>> customer. What you'd lose on this design would be the ability to get a
>> single transaction for a single customer by ID (since you're storing them
>> by date). But if you need that, you could actually store it both ways. You
>> also might be introducing some extra contention on concurrent transaction
>> PUT requests for a single client, because they'd have to fight over a lock
>> for the row (but that's probably not a big deal, since it's only
>> contentious within each customer).
>> 
>> You might find my presentation on designing HBase schemas (from this
>> year's HBaseCon) useful:
>> 
>> http://www.hbasecon.com/sessions/hbase-schema-design-2/
>> 
>> Ian
>> 
>> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:
>> 
>> Hi,
>> 
>> I have technical problem, and wander whether HBase or Cassandra
>> support Embedded table data model, or can somebody show me a way to do
>> this:
>> 
>> 1.We have a very large customer entity table which have 100 milliion
>> rows, each customer row has about 300 attributes(columns).
>> 2.Each customer do about 1000 transactions per year, each transaction
>> has about 30 attributes(columns), and we just save one year
>> transactions for each customer
>> 
>> We want a data model that  we can get the customer entity with all the
>> transactions which he did for a single client call within a fixed time
>> window, according to the customer id (which is the primary key of the
>> customer table). We do the following in RDBMS,
>> A customer table with customerid as the primary key, A

Re: Embedded table data model

2012-07-12 Thread Cole
I think this design has some question, please refer
http://hbase.apache.org/book/number.of.cfs.html

2012/7/12 Ian Varley 

> Yes, that's fine; you can always do a single column PUT into an existing
> row, in a concurrency-safe way, and the lock on the row is only held as
> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
> architecture, that's efficient because the PUT only goes to memory, and is
> merged with on-disk records at read time (until a regular flush or
> compaction happens).
>
> So even though you already have, say, 10K transactions in the table, it's
> still efficient to PUT a single new transaction in (whether that's in the
> middle of the sorted list of columns, at the end, etc.)
>
> Ian
>
> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
>
> but they are other writers insert new transactions into the table when
> customers do new transactions.
>
> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley  > wrote:
> Hi Xiaobo -
>
> For HBase, this is doable; you could have a single table in HBase where
> each row is a customer (with the customerid as the rowkey), and columns for
> each of the 300 attributes that are directly part of the customer entity.
> This is sparse, so you'd only take up space for the attributes that
> actually exist for each customer.
>
> You could then have (possibly in another column family, but not
> necessarily) an additional column for each transaction, where the column
> name is composed of a date concatenated with the transaction id, in which
> you store the 30 attributes as serialized into a single byte array in the
> cell value. (Or, you could alternately do each attribute as its own column
> but there's no advantage to doing so, since presumably a transaction is
> roughly like an immutable event that you wouldn't typically change just a
> single attribute of.) A schema for this (if spelled out in an xml
> representation) could be:
>
> 
>  
>
>  
>  
>
>
>...
>
>  
>  
>
>  
>
>
>  
>  
>  
>  ...
>  
>
>  
> 
>
> (This isn't real HBase syntax, it's just an abstract way to show you the
> structure.) In practice, HBase isn't doing anything "special" with the
> entity that lives nested inside your table; it's just a matter of
> convention, that you could "see" it that way. The customer-level attributes
> (like, say, "customer_name" and "customer_address") would be literal column
> names (aka column qualifiers) embedded in your code, whereas the
> transaction-oriented columns would be created at runtime with column names
> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
> objects (containing the 30 attributes) serialized into a byte array.
>
> In this scenario, you get fast access to any customer by ID, and further
> to a range of transactions by date (using, say, a column pagination
> filter). This would perform roughly equivalently regardless of how many
> customers are in the table, or how many transactions exist for each
> customer. What you'd lose on this design would be the ability to get a
> single transaction for a single customer by ID (since you're storing them
> by date). But if you need that, you could actually store it both ways. You
> also might be introducing some extra contention on concurrent transaction
> PUT requests for a single client, because they'd have to fight over a lock
> for the row (but that's probably not a big deal, since it's only
> contentious within each customer).
>
> You might find my presentation on designing HBase schemas (from this
> year's HBaseCon) useful:
>
> http://www.hbasecon.com/sessions/hbase-schema-design-2/
>
> Ian
>
> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:
>
> Hi,
>
> I have technical problem, and wander whether HBase or Cassandra
> support Embedded table data model, or can somebody show me a way to do
> this:
>
> 1.We have a very large customer entity table which have 100 milliion
> rows, each customer row has about 300 attributes(columns).
> 2.Each customer do about 1000 transactions per year, each transaction
> has about 30 attributes(columns), and we just save one year
> transactions for each customer
>
> We want a data model that  we can get the customer entity with all the
> transactions which he did for a single client call within a fixed time
> window, according to the customer id (which is the primary key of the
> customer table). We do the following in RDBMS,
> A customer table with customerid as the primary key, A transaction
> table with customer id as a secondary index, and join them , or we
> must do two separate  calls, and because we have so many concurrent
> readers and these two tables are became so large, the RDBMS system
> performs poor.
>
>
> Can we embedded the transactions inside the customer table in HBase or
> Cassandra?
>
>
> Regards,
>
> Xiaobo Gu
>
>
>


Re: HDFS + HBASE process high cpu usage

2012-07-12 Thread deanforwever2010
maybe there is some slow query
I met the same problem,I found out that I query 100 thousand columns of a
row, the hbase had no response and stopped working.

2012/7/13 Esteban Gutierrez 

> Hi Asaf,
>
> By any chance is this issue has been going on in your boxes for the last
> few days? I won't be surprised by so many calls to futex by the JVM itself,
> but since you are giving the same symptoms as the leap second issue it
> would be good to know what OS are you using, if NTP is/was running or not
> and if the boxes have been restarted or not after jul/1. If the leap second
> issue is the cause of this, then just running date -s "`date`" as root wil
> lower the cpu usage.
>
> regards,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
>
>
> On Thu, Jul 12, 2012 at 10:12 AM, Asaf Mesika 
> wrote:
>
> > Just adding more information.
> > The following is a histogram output of 'strace -p  -f -C' which
> > ran for 10 seconds. From some reason futex takes 65% of the time.
> >
> > % time seconds  usecs/call callserrors syscall
> > -- --- --- - - 
> >  65.06   11.097387 103108084 53662 futex
> >  12.002.047692  17064112 3 restart_syscall
> >   8.731.488824   2326364   accept
> >   6.991.1921925624   212   poll
> >   6.601.125829   2251750   epoll_wait
> >   0.260.045039 50689   close
> >   0.190.031703 170   187   sendto
> >   0.040.007508 11068   setsockopt
> >   0.030.005558  27   209   recvfrom
> >   0.020.003000 375 8   sched_yield
> >   0.020.002999 10728 1 epoll_ctl
> >   0.010.002000 12516   open
> >   0.010.001999 16712   getsockname
> >   0.010.001156  3632   write
> >   0.010.001000 10010   fstat
> >   0.010.001000  3033   fcntl
> >   0.010.000999  1567   dup2
> >   0.000.000488  98 5   rt_sigreturn
> >   0.000.000350   84610 read
> >   0.000.000222   451   mprotect
> >   0.000.000167  42 4   openat
> >   0.000.92   252   stat
> >   0.000.84   245   statfs
> >   0.000.74   421   mmap
> >   0.000.00   0 9   munmap
> >   0.000.00   026   rt_sigprocmask
> >   0.000.00   0 3   ioctl
> >   0.000.00   0 1   pipe
> >   0.000.00   0 5   madvise
> >   0.000.00   0 6   socket
> >   0.000.00   0 6 4 connect
> >   0.000.00   0 1   shutdown
> >   0.000.00   0 3   getsockopt
> >   0.000.00   0 7   clone
> >   0.000.00   0 8   getdents
> >   0.000.00   0 3   getrlimit
> >   0.000.00   0 6   sysinfo
> >   0.000.00   0 7   gettid
> >   0.000.00   014   sched_getaffinity
> >   0.000.00   0 1   epoll_create
> >   0.000.00   0 7   set_robust_list
> > -- --- --- - - 
> > 100.00   17.057362109518 53680 total
> >
> > On Jul 12, 2012, at 18:09 PM, Asaf Mesika wrote:
> >
> > > Hi,
> > >
> > > I have a cluster of 3 DN/RS and another computer hosting NN/Master.
> > >
> > > From some reason, two of the DataNode nodes are showing high load
> > average (~17).
> > > When using "top" I can see HDFS and HBASE processes are the one using
> > the most of the cpu (95% in top).
> > >
> > > When inspecting both HDFS and HBASE through JVisualVM on the
> problematic
> > nodes, I can clearly see that the cpu usage is high.
> > >
> > > Any ideas why its happening on those two nodes (and why the 3rd is
> > resting happily)?
> > >
> > > All three computers have roughly the same hardware.
> > > The Cluster (both HBASE and HDFS) are not used currently (during my
> > inspection).
> > >
> > > Both HDFS and HBASE logs don't show any particular activity.
> > >
> > >
> > > Any leads on where should I look for more would be appreciated.
> > >
> > >
> > > Thanks!
> > >
> > > Asaf
> > >
> >
> >
>


Re: HDFS + HBASE process high cpu usage

2012-07-12 Thread Esteban Gutierrez
Hi Asaf,

By any chance is this issue has been going on in your boxes for the last
few days? I won't be surprised by so many calls to futex by the JVM itself,
but since you are giving the same symptoms as the leap second issue it
would be good to know what OS are you using, if NTP is/was running or not
and if the boxes have been restarted or not after jul/1. If the leap second
issue is the cause of this, then just running date -s "`date`" as root wil
lower the cpu usage.

regards,
esteban.


--
Cloudera, Inc.




On Thu, Jul 12, 2012 at 10:12 AM, Asaf Mesika  wrote:

> Just adding more information.
> The following is a histogram output of 'strace -p  -f -C' which
> ran for 10 seconds. From some reason futex takes 65% of the time.
>
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  65.06   11.097387 103108084 53662 futex
>  12.002.047692  17064112 3 restart_syscall
>   8.731.488824   2326364   accept
>   6.991.1921925624   212   poll
>   6.601.125829   2251750   epoll_wait
>   0.260.045039 50689   close
>   0.190.031703 170   187   sendto
>   0.040.007508 11068   setsockopt
>   0.030.005558  27   209   recvfrom
>   0.020.003000 375 8   sched_yield
>   0.020.002999 10728 1 epoll_ctl
>   0.010.002000 12516   open
>   0.010.001999 16712   getsockname
>   0.010.001156  3632   write
>   0.010.001000 10010   fstat
>   0.010.001000  3033   fcntl
>   0.010.000999  1567   dup2
>   0.000.000488  98 5   rt_sigreturn
>   0.000.000350   84610 read
>   0.000.000222   451   mprotect
>   0.000.000167  42 4   openat
>   0.000.92   252   stat
>   0.000.84   245   statfs
>   0.000.74   421   mmap
>   0.000.00   0 9   munmap
>   0.000.00   026   rt_sigprocmask
>   0.000.00   0 3   ioctl
>   0.000.00   0 1   pipe
>   0.000.00   0 5   madvise
>   0.000.00   0 6   socket
>   0.000.00   0 6 4 connect
>   0.000.00   0 1   shutdown
>   0.000.00   0 3   getsockopt
>   0.000.00   0 7   clone
>   0.000.00   0 8   getdents
>   0.000.00   0 3   getrlimit
>   0.000.00   0 6   sysinfo
>   0.000.00   0 7   gettid
>   0.000.00   014   sched_getaffinity
>   0.000.00   0 1   epoll_create
>   0.000.00   0 7   set_robust_list
> -- --- --- - - 
> 100.00   17.057362109518 53680 total
>
> On Jul 12, 2012, at 18:09 PM, Asaf Mesika wrote:
>
> > Hi,
> >
> > I have a cluster of 3 DN/RS and another computer hosting NN/Master.
> >
> > From some reason, two of the DataNode nodes are showing high load
> average (~17).
> > When using "top" I can see HDFS and HBASE processes are the one using
> the most of the cpu (95% in top).
> >
> > When inspecting both HDFS and HBASE through JVisualVM on the problematic
> nodes, I can clearly see that the cpu usage is high.
> >
> > Any ideas why its happening on those two nodes (and why the 3rd is
> resting happily)?
> >
> > All three computers have roughly the same hardware.
> > The Cluster (both HBASE and HDFS) are not used currently (during my
> inspection).
> >
> > Both HDFS and HBASE logs don't show any particular activity.
> >
> >
> > Any leads on where should I look for more would be appreciated.
> >
> >
> > Thanks!
> >
> > Asaf
> >
>
>


Re: DataNode Hardware

2012-07-12 Thread Michael Segel
Uhm... I'd take a step back... 
> Thanks for the reply. I didn't realized that all the non-MR tasks were this 
> CPU bound; plus my naive assumption was that four spindles will have a hard 
> time supplying data to MR fast enough for it to become bogged down.


Your gut feel is correct. 

If you go w 12 cores in a 1U box and 4 drives, you will be disk i/o bound and 
you will end up watching wait CPU cycles increase. 
On a 1 U box, 8 cores would be a bit better balance. Maybe go w 2.5" drives and 
more spindles. 

If you don't run HBase, 4GB per core is ok just for map/reduce.  You will want 
more memory for Hbase. 
8 cores 32GB for M/R ok Hbase, 48GB better. 


On Jul 12, 2012, at 4:00 PM, Bartosz M. Frak wrote:

> Amandeep Khurana wrote:
>> The issue with having lower cores per box is that you are collocating 
>> datanode, region servers, task trackers and then the MR tasks themselves 
>> too. Plus you need a core for the OS too. These are things that need to run 
>> on a single node, so you need a minimum amount of resources that can handle 
>> all of this well. I don't see how you will be able to do compute heavy stuff 
>> in 4 cores even if you give 1 to the OS, 1 to datanodes and task tracker 
>> processes and 1 to the region server. You are left with only 1 core for the 
>> actual tasks to run. 
>> Also, if you really want low latency access to data in a reliable manner, I 
>> would separate out the MR framework onto an independent cluster and put 
>> HBase on an independent cluster. The MR framework will talk to the HBase 
>> cluster for look ups though. You'll still benefit from the caching etc but 
>> HBase will be able to guarantee performance better.
>> 
>> -Amandeep 
>> 
>>  
> Thanks for the reply. I didn't realized that all the non-MR tasks were this 
> CPU bound; plus my naive assumption was that four spindles will have a hard 
> time supplying data to MR fast enough for it to become bogged down.
> 
>> On Thursday, July 12, 2012 at 1:20 PM, Bartosz M. Frak wrote:
>> 
>>  
>>> Amandeep Khurana wrote:
>>>
 Inline. 
 
 On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:
 
  
> Quick question about data node hadrware. I've read a few articles, which 
> cover the basics, including the Cloudera's recommendations here:
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> 
> The article is from early 2010, but I'm assuming that the general 
> guidelines haven't deviated much from the recommended baselines. I'm 
> skewing my build towards the "Compute optimized" side of the spectrum, 
> which calls for a a 1:1 core to spindle model and more RAM for per node 
> for in-memory caching.
> 
> 
>
 Why are you skewing more towards compute optimized. Are you expecting to 
 run compute intensive MR interacting with HBase tables? 
 
  
>>> Correct. We'll storing dense raw numerical time-based data, which will need 
>>> to be transformed (decimated, FFTed, correlated, etc) with relatively low 
>>> latency (under 10 seconds). We also expect repeatable reads, where the same 
>>> piece of data is "looked" at more than once in a short amount of time. This 
>>> is where we are hoping that in-memory caching and data node affinity can 
>>> help us.
>>>
> Other important consideration is low(ish) power consumption. With that in 
> mind I had specced out the following (per node):
> 
> Chassis: 1U Supermicro chassis with 2x 1Gb/sec ethernet ports 
> (http://www.supermicro.com/products/system/1u/5017/sys-5017c-mtf.cfm) 
> (~500USD)
> Memory: 32GB Unbuffered ECC RAM (~280USD)
> Disks: 4x2TBHitachi Ultrastar 7200RPM SAS Drives (~960USD)
> 
> 
>
 You can use plain SATA. Don't need SAS. 
 
  
>>> This is a government sponsored project, so some requirements (like MTBF and 
>>> spindle warranty) for are "set in stone", but I'll look into that.
>>>
> CPU: 1x Intel E3-1230-v2 (3.3Ghz 4 Core / 8 Thread 69W) (~240USD)
> 
> 
>
 Consider getting dual hex core CPUs.
 
 
  
>>> I'm trying to avoid that for two reasons. Dual socket boards are (1) more 
>>> expensive and (2) power hungry. Additionally the CPUs for those boards are 
>>> also more expensive and less efficient than the one socket counterparts 
>>> (take a look at Intel's E3 and E5 line pricing). The guidelines from the 
>>> quited article state:
>>> 
>>> "Compute Intensive Configuration (2U/machine): Two quad core CPUs, 48-72GB 
>>> memory, and 8 disk drives (1TB or 2TB). These are often used when a 
>>> combination of large in-memory models and heavy reference data caching is 
>>> required."
>>> 
>>> My two 1U machines, which are equivalent to this remediations have 8 (very 
>>> fast, low wattage) cores, 64GB RAM and 8 2TB disks.
>>> 
>>>
> The backplane will consist of a dedi

Re: hbase secure channel

2012-07-12 Thread Andrew Purtell
On Thu, Jul 12, 2012 at 2:20 PM, Tony Dean  wrote:
> Hi,
>
> Once authentication has been accomplished the application data begins to flow 
> between client and server.  How can one assure that the data is private?
>
> I see an hbase property to turn on privacy: hbase.rpc.protection=privacy.

This tells SASL on the server side to require successful 'auth-conf'
negotiation instead of just 'auth'. The result is a connection wrapped
by encryption with a shared key or no connection if the negotiation
fails. SASL delegates keying set up to the security layer
implementation. For Hadoop/HBase that would be Kerberos.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)


hbase secure channel

2012-07-12 Thread Tony Dean
Hi,

Once authentication has been accomplished the application data begins to flow 
between client and server.  How can one assure that the data is private?

I see an hbase property to turn on privacy: hbase.rpc.protection=privacy.  Is 
this basically SSL, but instead of using certificates, it's using the Kerberos 
shared key that was deposited at the service when the client sends service 
ticket?

Thanks.

-Tony






Re: hbase multi-user security

2012-07-12 Thread Devaraj Das
In the secure mode, the server will expect to see the [rpc-user == 
authenticating-user]. So (without code digging, IIRC) the idea of using a 
random rpc-user might not work.. The proxy user (my earlier mail) stuff 
attempts to address this problem. Please correct me if I am missing/overlooking 
something, Andrew.

On Jul 12, 2012, at 1:49 PM, Tony Dean wrote:

> gotcha.  why not create a UserContext thread-local class in which consumers 
> can set a specific UGI that they create and thus the secure RPC client hbase 
> code can use it if it's there; otherwise fallback to the static UGI loginUser?
> 
> consumers can choose to take the thread-local hit or not.
> 
> -Tony
> 
> -Original Message-
> From: Andrew Purtell [mailto:apurt...@apache.org] 
> Sent: Thursday, July 12, 2012 4:09 PM
> To: user@hbase.apache.org
> Subject: Re: hbase multi-user security
> 
> On Thu, Jul 12, 2012 at 12:44 PM, Tony Dean  wrote:
> 
>> I'm wondering how that proxy user can be injected into the RPC connection 
>> when making requests.
> 
> Right, hence the suggestion to be able to set User per thread, at least, via 
> a thread local, so you can set at will and RPC will pick it up.
> 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
> Tom White)
> 
> 



Implement a shell in the Master UI

2012-07-12 Thread Claudiu Olteanu
Hello!

My name in Claudiu Olteanu and I want to implement a shell in the Master UI. 
The problem is that
I  don't know how to capture the output of the IRB's commands. I tried to 
create a new ruby class 

which runs the commands and save the stdout but it can't call any IRB's 
methods. 

I've never used ruby before, so please give me some tips!

You can find a sample of my code here [1].

Best regards,
Claudiu


[1] - 
http://stackoverflow.com/questions/11457960/rubynomethoderror-call-a-ruby-method-from-java


Re: hbase multi-user security

2012-07-12 Thread Devaraj Das
Wouldn't this work:

User user = 
User.create(UserGroupInformation.createProxyUser(userToImpersonate, 
UserGroupInformation.getLoginUser()))

//Run the regionserver operation within a runAs (authentication will happen 
using the credentials of the loginuser)
user.runAs(...)

At the RPC layer, the connections are keyed by an object that has User instance 
too and so things should work.. The User class doesn't have a createProxyUser 
api - hence the call to UserGroupInformation.createProxyUser.

On Jul 12, 2012, at 1:09 PM, Andrew Purtell wrote:

> On Thu, Jul 12, 2012 at 12:44 PM, Tony Dean  wrote:
> 
>> I'm wondering how that proxy user can be injected into the RPC connection 
>> when making requests.
> 
> Right, hence the suggestion to be able to set User per thread, at
> least, via a thread local, so you can set at will and RPC will pick it
> up.
> 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)



Re: DataNode Hardware

2012-07-12 Thread Bartosz M. Frak

Amandeep Khurana wrote:
The issue with having lower cores per box is that you are collocating datanode, region servers, task trackers and then the MR tasks themselves too. Plus you need a core for the OS too. These are things that need to run on a single node, so you need a minimum amount of resources that can handle all of this well. I don't see how you will be able to do compute heavy stuff in 4 cores even if you give 1 to the OS, 1 to datanodes and task tracker processes and 1 to the region server. You are left with only 1 core for the actual tasks to run. 


Also, if you really want low latency access to data in a reliable manner, I 
would separate out the MR framework onto an independent cluster and put HBase 
on an independent cluster. The MR framework will talk to the HBase cluster for 
look ups though. You'll still benefit from the caching etc but HBase will be 
able to guarantee performance better.

-Amandeep 



  
Thanks for the reply. I didn't realized that all the non-MR tasks were 
this CPU bound; plus my naive assumption was that four spindles will 
have a hard time supplying data to MR fast enough for it to become 
bogged down.



On Thursday, July 12, 2012 at 1:20 PM, Bartosz M. Frak wrote:

  

Amandeep Khurana wrote:

Inline. 



On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:

  
Quick question about data node hadrware. I've read a few articles, which 
cover the basics, including the Cloudera's recommendations here:

http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

The article is from early 2010, but I'm assuming that the general 
guidelines haven't deviated much from the recommended baselines. I'm 
skewing my build towards the "Compute optimized" side of the spectrum, 
which calls for a a 1:1 core to spindle model and more RAM for per node 
for in-memory caching.




Why are you skewing more towards compute optimized. Are you expecting to run compute intensive MR interacting with HBase tables? 



  
Correct. We'll storing dense raw numerical time-based data, which will 
need to be transformed (decimated, FFTed, correlated, etc) with 
relatively low latency (under 10 seconds). We also expect repeatable 
reads, where the same piece of data is "looked" at more than once in a 
short amount of time. This is where we are hoping that in-memory caching 
and data node affinity can help us.

Other important consideration is low(ish) power 
consumption. With that in mind I had specced out the following (per node):


Chassis: 1U Supermicro chassis with 2x 1Gb/sec ethernet ports 
(http://www.supermicro.com/products/system/1u/5017/sys-5017c-mtf.cfm) 
(~500USD)

Memory: 32GB Unbuffered ECC RAM (~280USD)
Disks: 4x2TBHitachi Ultrastar 7200RPM SAS Drives (~960USD)



You can use plain SATA. Don't need SAS. 



  
This is a government sponsored project, so some requirements (like MTBF 
and spindle warranty) for are "set in stone", but I'll look into that.


CPU: 1x Intel E3-1230-v2 (3.3Ghz 4 Core / 8 Thread 69W) (~240USD)




Consider getting dual hex core CPUs.


  
I'm trying to avoid that for two reasons. Dual socket boards are (1) 
more expensive and (2) power hungry. Additionally the CPUs for those 
boards are also more expensive and less efficient than the one socket 
counterparts (take a look at Intel's E3 and E5 line pricing). The 
guidelines from the quited article state:


"Compute Intensive Configuration (2U/machine): Two quad core CPUs, 
48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used 
when a combination of large in-memory models and heavy reference data 
caching is required."


My two 1U machines, which are equivalent to this remediations have 8 
(very fast, low wattage) cores, 64GB RAM and 8 2TB disks.



The backplane will consist of a dedicated high powered switch (not sure 
which one yet) with each node utilizing link aggregation.


Does this look reasonable? We are looking into buying 4-5 of those for 
our initial test bench for under $1 and plan to expand to about 
50-100 nodes by next year.






  



  




RE: hbase multi-user security

2012-07-12 Thread Tony Dean
gotcha.  why not create a UserContext thread-local class in which consumers can 
set a specific UGI that they create and thus the secure RPC client hbase code 
can use it if it's there; otherwise fallback to the static UGI loginUser?

consumers can choose to take the thread-local hit or not.

-Tony

-Original Message-
From: Andrew Purtell [mailto:apurt...@apache.org] 
Sent: Thursday, July 12, 2012 4:09 PM
To: user@hbase.apache.org
Subject: Re: hbase multi-user security

On Thu, Jul 12, 2012 at 12:44 PM, Tony Dean  wrote:

> I'm wondering how that proxy user can be injected into the RPC connection 
> when making requests.

Right, hence the suggestion to be able to set User per thread, at least, via a 
thread local, so you can set at will and RPC will pick it up.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)




Re: DataNode Hardware

2012-07-12 Thread Amandeep Khurana
The issue with having lower cores per box is that you are collocating datanode, 
region servers, task trackers and then the MR tasks themselves too. Plus you 
need a core for the OS too. These are things that need to run on a single node, 
so you need a minimum amount of resources that can handle all of this well. I 
don't see how you will be able to do compute heavy stuff in 4 cores even if you 
give 1 to the OS, 1 to datanodes and task tracker processes and 1 to the region 
server. You are left with only 1 core for the actual tasks to run. 

Also, if you really want low latency access to data in a reliable manner, I 
would separate out the MR framework onto an independent cluster and put HBase 
on an independent cluster. The MR framework will talk to the HBase cluster for 
look ups though. You'll still benefit from the caching etc but HBase will be 
able to guarantee performance better.

-Amandeep 


On Thursday, July 12, 2012 at 1:20 PM, Bartosz M. Frak wrote:

> Amandeep Khurana wrote:
> > Inline. 
> > 
> > 
> > On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:
> > 
> > > Quick question about data node hadrware. I've read a few articles, which 
> > > cover the basics, including the Cloudera's recommendations here:
> > > http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> > > 
> > > The article is from early 2010, but I'm assuming that the general 
> > > guidelines haven't deviated much from the recommended baselines. I'm 
> > > skewing my build towards the "Compute optimized" side of the spectrum, 
> > > which calls for a a 1:1 core to spindle model and more RAM for per node 
> > > for in-memory caching.
> > > 
> > > 
> > 
> > Why are you skewing more towards compute optimized. Are you expecting to 
> > run compute intensive MR interacting with HBase tables? 
> > 
> > 
> 
> Correct. We'll storing dense raw numerical time-based data, which will 
> need to be transformed (decimated, FFTed, correlated, etc) with 
> relatively low latency (under 10 seconds). We also expect repeatable 
> reads, where the same piece of data is "looked" at more than once in a 
> short amount of time. This is where we are hoping that in-memory caching 
> and data node affinity can help us.
> > > Other important consideration is low(ish) power 
> > > consumption. With that in mind I had specced out the following (per node):
> > > 
> > > Chassis: 1U Supermicro chassis with 2x 1Gb/sec ethernet ports 
> > > (http://www.supermicro.com/products/system/1u/5017/sys-5017c-mtf.cfm) 
> > > (~500USD)
> > > Memory: 32GB Unbuffered ECC RAM (~280USD)
> > > Disks: 4x2TBHitachi Ultrastar 7200RPM SAS Drives (~960USD)
> > > 
> > > 
> > 
> > You can use plain SATA. Don't need SAS. 
> > 
> > 
> 
> This is a government sponsored project, so some requirements (like MTBF 
> and spindle warranty) for are "set in stone", but I'll look into that.
> > > CPU: 1x Intel E3-1230-v2 (3.3Ghz 4 Core / 8 Thread 69W) (~240USD)
> > > 
> > > 
> > 
> > Consider getting dual hex core CPUs.
> > 
> > 
> 
> I'm trying to avoid that for two reasons. Dual socket boards are (1) 
> more expensive and (2) power hungry. Additionally the CPUs for those 
> boards are also more expensive and less efficient than the one socket 
> counterparts (take a look at Intel's E3 and E5 line pricing). The 
> guidelines from the quited article state:
> 
> "Compute Intensive Configuration (2U/machine): Two quad core CPUs, 
> 48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used 
> when a combination of large in-memory models and heavy reference data 
> caching is required."
> 
> My two 1U machines, which are equivalent to this remediations have 8 
> (very fast, low wattage) cores, 64GB RAM and 8 2TB disks.
> 
> > > The backplane will consist of a dedicated high powered switch (not sure 
> > > which one yet) with each node utilizing link aggregation.
> > > 
> > > Does this look reasonable? We are looking into buying 4-5 of those for 
> > > our initial test bench for under $1 and plan to expand to about 
> > > 50-100 nodes by next year.
> > > 
> > > 
> > 
> > 
> > 



Re: DataNode Hardware

2012-07-12 Thread Bartosz M. Frak

Amandeep Khurana wrote:
Inline. 



On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:

  
Quick question about data node hadrware. I've read a few articles, which 
cover the basics, including the Cloudera's recommendations here:

http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

The article is from early 2010, but I'm assuming that the general 
guidelines haven't deviated much from the recommended baselines. I'm 
skewing my build towards the "Compute optimized" side of the spectrum, 
which calls for a a 1:1 core to spindle model and more RAM for per node 
for in-memory caching.






Why are you skewing more towards compute optimized. Are you expecting to run compute intensive MR interacting with HBase tables? 
  
Correct. We'll storing dense raw numerical time-based data, which will 
need to be transformed (decimated, FFTed, correlated, etc) with 
relatively low latency (under 10 seconds). We also expect repeatable 
reads, where the same piece of data is "looked" at more than once in a 
short amount of time. This is where we are hoping that in-memory caching 
and data node affinity can help us.
Other important consideration is low(ish) power 
consumption. With that in mind I had specced out the following (per node):


Chassis: 1U Supermicro chassis with 2x 1Gb/sec ethernet ports 
(http://www.supermicro.com/products/system/1u/5017/sys-5017c-mtf.cfm) 
(~500USD)

Memory: 32GB Unbuffered ECC RAM (~280USD)
Disks: 4x2TBHitachi Ultrastar 7200RPM SAS Drives (~960USD)





You can use plain SATA. Don't need SAS. 
  
This is a government sponsored project, so some requirements (like MTBF 
and spindle warranty) for  are "set in stone", but I'll look into that.

CPU: 1x Intel E3-1230-v2 (3.3Ghz 4 Core / 8 Thread 69W) (~240USD)





Consider getting dual hex core CPUs.
  
I'm trying to avoid that for two reasons. Dual socket boards are (1) 
more expensive and (2) power hungry. Additionally the CPUs for those 
boards are also more expensive and less efficient than the one socket 
counterparts (take a look at Intel's E3 and E5 line pricing). The 
guidelines from the quited article state:


"Compute Intensive Configuration (2U/machine): Two quad core CPUs, 
48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used 
when a combination of large in-memory models and heavy reference data 
caching is required."


My two 1U machines, which are equivalent to this remediations have 8 
(very fast, low wattage) cores, 64GB RAM and 8 2TB disks.


The backplane will consist of a dedicated high powered switch (not sure 
which one yet) with each node utilizing link aggregation.


Does this look reasonable? We are looking into buying 4-5 of those for 
our initial test bench for under $1 and plan to expand to about 
50-100 nodes by next year.








  




Re: hbase multi-user security

2012-07-12 Thread Andrew Purtell
On Thu, Jul 12, 2012 at 12:44 PM, Tony Dean  wrote:

> I'm wondering how that proxy user can be injected into the RPC connection 
> when making requests.

Right, hence the suggestion to be able to set User per thread, at
least, via a thread local, so you can set at will and RPC will pick it
up.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)


RE: hbase multi-user security

2012-07-12 Thread Tony Dean
Devaraj,

I do see hbase secure impersonation being a nice feature for a multi-user 
environment.  You authenticate one user and perform actions based on other 
identities.

-Tony

-Original Message-
From: Tony Dean 
Sent: Thursday, July 12, 2012 3:45 PM
To: user@hbase.apache.org
Subject: RE: hbase multi-user security

Thanks Andy for the reply.

I understand your normal use case...

If we are hosting we could create separate Web apps per client so that 
authentication occurs for each client back to the same hbase/hadoop cluster... 
therefore, each client would see only the data that they are supposed to see.

In looking at UGI, I see createProxyUser(...).  Could that be useful?  It 
returns a UGI object.  But, I'm wondering how that proxy user can be injected 
into the RPC connection when making requests.

Thanks again.

-Original Message-
From: Andrew Purtell [mailto:apurt...@apache.org] 
Sent: Wednesday, July 11, 2012 3:11 PM
To: user@hbase.apache.org
Subject: Re: hbase multi-user security

On Wed, Jul 11, 2012 at 11:51 AM, Tony Dean  wrote:
> Yes, I saw that.  But one you have a User how do you get the SecureClient 
> connection to use it?  It seems to just call User.getCurrent().  And its 
> static so there can only be 1.

I think Hadoop's UserGroupInformation is the same, static.

We didn't consider a use case where a client application would have more than 
one credential. For how Hadoop security was used up to that point, that wasn't 
common (or done at all?).

A pretty easy change would be to make User.getCurrent() look up a thread local 
variable. Then we could change the principal on a per thread basis in a 
multithreaded/multiuser application. Use of the thread local has a 
unconditional performance cost though.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)




Re: DataNode Hardware

2012-07-12 Thread Amandeep Khurana
Inline. 


On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:

> Quick question about data node hadrware. I've read a few articles, which 
> cover the basics, including the Cloudera's recommendations here:
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> 
> The article is from early 2010, but I'm assuming that the general 
> guidelines haven't deviated much from the recommended baselines. I'm 
> skewing my build towards the "Compute optimized" side of the spectrum, 
> which calls for a a 1:1 core to spindle model and more RAM for per node 
> for in-memory caching.
> 
> 

Why are you skewing more towards compute optimized. Are you expecting to run 
compute intensive MR interacting with HBase tables? 
> Other important consideration is low(ish) power 
> consumption. With that in mind I had specced out the following (per node):
> 
> Chassis: 1U Supermicro chassis with 2x 1Gb/sec ethernet ports 
> (http://www.supermicro.com/products/system/1u/5017/sys-5017c-mtf.cfm) 
> (~500USD)
> Memory: 32GB Unbuffered ECC RAM (~280USD)
> Disks: 4x2TBHitachi Ultrastar 7200RPM SAS Drives (~960USD)
> 
> 

You can use plain SATA. Don't need SAS. 
> CPU: 1x Intel E3-1230-v2 (3.3Ghz 4 Core / 8 Thread 69W) (~240USD)
> 
> 

Consider getting dual hex core CPUs.
> 
> The backplane will consist of a dedicated high powered switch (not sure 
> which one yet) with each node utilizing link aggregation.
> 
> Does this look reasonable? We are looking into buying 4-5 of those for 
> our initial test bench for under $1 and plan to expand to about 
> 50-100 nodes by next year.
> 
> 




DataNode Hardware

2012-07-12 Thread Bartosz M. Frak
Quick question about data node hadrware. I've read a few articles, which 
cover the basics, including the Cloudera's recommendations here:

http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

The article is from early 2010, but I'm assuming that the general 
guidelines haven't deviated much from the recommended baselines. I'm 
skewing my build towards the "Compute optimized" side of the spectrum, 
which calls for a a 1:1 core to spindle model and more RAM for per node 
for in-memory caching. Other important consideration is low(ish) power 
consumption. With that in mind I had specced out the following (per node):


Chassis: 1U Supermicro chassis with 2x 1Gb/sec ethernet ports 
(http://www.supermicro.com/products/system/1u/5017/sys-5017c-mtf.cfm) 
(~500USD)

Memory: 32GB Unbuffered ECC RAM (~280USD)
Disks: 4x2TBHitachi Ultrastar 7200RPM SAS Drives (~960USD)
CPU: 1x Intel E3-1230-v2 (3.3Ghz 4 Core / 8 Thread 69W) (~240USD)

The backplane will consist of a dedicated high powered switch (not sure 
which one yet) with each node utilizing link aggregation.


Does this look reasonable? We are looking into buying 4-5 of those for 
our initial test bench for under $1 and plan to expand to about 
50-100 nodes by next year.






RE: hbase multi-user security

2012-07-12 Thread Tony Dean
Thanks Andy for the reply.

I understand your normal use case...

If we are hosting we could create separate Web apps per client so that 
authentication occurs for each client back to the same hbase/hadoop cluster... 
therefore, each client would see only the data that they are supposed to see.

In looking at UGI, I see createProxyUser(...).  Could that be useful?  It 
returns a UGI object.  But, I'm wondering how that proxy user can be injected 
into the RPC connection when making requests.

Thanks again.

-Original Message-
From: Andrew Purtell [mailto:apurt...@apache.org] 
Sent: Wednesday, July 11, 2012 3:11 PM
To: user@hbase.apache.org
Subject: Re: hbase multi-user security

On Wed, Jul 11, 2012 at 11:51 AM, Tony Dean  wrote:
> Yes, I saw that.  But one you have a User how do you get the SecureClient 
> connection to use it?  It seems to just call User.getCurrent().  And its 
> static so there can only be 1.

I think Hadoop's UserGroupInformation is the same, static.

We didn't consider a use case where a client application would have more than 
one credential. For how Hadoop security was used up to that point, that wasn't 
common (or done at all?).

A pretty easy change would be to make User.getCurrent() look up a thread local 
variable. Then we could change the principal on a per thread basis in a 
multithreaded/multiuser application. Use of the thread local has a 
unconditional performance cost though.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)




Re: hbase multi-user security

2012-07-12 Thread Devaraj Das

On Jul 11, 2012, at 10:41 AM, Tony Dean wrote:

> Hi,
> 
> Looking into hbase security, it appears that when HBaseRPC is creating a 
> proxy (e.g., SecureRpcEngine), it injects the current user:
> User.getCurrent() which by default is the cached Kerberos TGT (kinit'ed user 
> - using the "hadoop-user-kerberos" JAAS context).
> 
> Since the server proxy always uses User.getCurrent(), how can an application 
> inject the user it wants to use for authorization checks on the peer (region 
> server)?
> 
> And since SecureHadoopUser is a static class, how can you have more than 1 
> active user in the same application?
> 
> What you have works for a single user application like the hbase shell, but 
> what about a multi-user application?
> 

Over in Hadoop, in order to support use cases like Oozie where it would need to 
talk to NameNode and JobTracker on behalf of other users, the concept of secure 
impersonation was introduced. Have a look at 
http://hadoop.apache.org/common/docs/r1.0.3/Secure_Impersonation.html. This can 
be mapped to the HBase land.. Do you think this would address your need, Tony?

> Am I missing something?
> 
> Thanks!
> Tony Dean
> SAS Institute Inc.
> Senior Software Developer
> 919-531-6704
> 
> 
> 
> 
> 



Re: HDFS + HBASE process high cpu usage

2012-07-12 Thread Asaf Mesika
Just adding more information.
The following is a histogram output of 'strace -p  -f -C' which ran 
for 10 seconds. From some reason futex takes 65% of the time. 

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 65.06   11.097387 103108084 53662 futex
 12.002.047692  17064112 3 restart_syscall
  8.731.488824   2326364   accept
  6.991.1921925624   212   poll
  6.601.125829   2251750   epoll_wait
  0.260.045039 50689   close
  0.190.031703 170   187   sendto
  0.040.007508 11068   setsockopt
  0.030.005558  27   209   recvfrom
  0.020.003000 375 8   sched_yield
  0.020.002999 10728 1 epoll_ctl
  0.010.002000 12516   open
  0.010.001999 16712   getsockname
  0.010.001156  3632   write
  0.010.001000 10010   fstat
  0.010.001000  3033   fcntl
  0.010.000999  1567   dup2
  0.000.000488  98 5   rt_sigreturn
  0.000.000350   84610 read
  0.000.000222   451   mprotect
  0.000.000167  42 4   openat
  0.000.92   252   stat
  0.000.84   245   statfs
  0.000.74   421   mmap
  0.000.00   0 9   munmap
  0.000.00   026   rt_sigprocmask
  0.000.00   0 3   ioctl
  0.000.00   0 1   pipe
  0.000.00   0 5   madvise
  0.000.00   0 6   socket
  0.000.00   0 6 4 connect
  0.000.00   0 1   shutdown
  0.000.00   0 3   getsockopt
  0.000.00   0 7   clone
  0.000.00   0 8   getdents
  0.000.00   0 3   getrlimit
  0.000.00   0 6   sysinfo
  0.000.00   0 7   gettid
  0.000.00   014   sched_getaffinity
  0.000.00   0 1   epoll_create
  0.000.00   0 7   set_robust_list
-- --- --- - - 
100.00   17.057362109518 53680 total

On Jul 12, 2012, at 18:09 PM, Asaf Mesika wrote:

> Hi,
> 
> I have a cluster of 3 DN/RS and another computer hosting NN/Master.
> 
> From some reason, two of the DataNode nodes are showing high load average 
> (~17).
> When using "top" I can see HDFS and HBASE processes are the one using the 
> most of the cpu (95% in top).
> 
> When inspecting both HDFS and HBASE through JVisualVM on the problematic 
> nodes, I can clearly see that the cpu usage is high.
> 
> Any ideas why its happening on those two nodes (and why the 3rd is resting 
> happily)?
> 
> All three computers have roughly the same hardware.
> The Cluster (both HBASE and HDFS) are not used currently (during my 
> inspection).
> 
> Both HDFS and HBASE logs don't show any particular activity.
> 
> 
> Any leads on where should I look for more would be appreciated.
> 
> 
> Thanks!
> 
> Asaf
> 



HDFS + HBASE process high cpu usage

2012-07-12 Thread Asaf Mesika
Hi,

I have a cluster of 3 DN/RS and another computer hosting NN/Master.

From some reason, two of the DataNode nodes are showing high load average (~17).
When using "top" I can see HDFS and HBASE processes are the one using the most 
of the cpu (95% in top).

When inspecting both HDFS and HBASE through JVisualVM on the problematic nodes, 
I can clearly see that the cpu usage is high.

Any ideas why its happening on those two nodes (and why the 3rd is resting 
happily)?

All three computers have roughly the same hardware.
The Cluster (both HBASE and HDFS) are not used currently (during my inspection).

Both HDFS and HBASE logs don't show any particular activity.


Any leads on where should I look for more would be appreciated.


Thanks!

Asaf



Re: Why Hadoop can't find Reducer when Mapper reads data from HBase?

2012-07-12 Thread Stack
On Thu, Jul 12, 2012 at 1:15 PM, yonghu  wrote:
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> com.mapreducetablescan.MRTableAccess$MTableReducer;
>
> Does anybody know why?
>

Its not in your job jar?  Check the job jar (jar -tf JAR_FILE).

St.Ack


Re: Blocking Inserts

2012-07-12 Thread Martin Alig
Thank you for the comment.

Compaction queue seems to be at 0 (?) all the time.
About the blocking store file: I already increased this value, but I could
not see any improvements.

Going through the logs during a "blocking" period, I often see a
"CompactionRequest". Then, for 1 minute or so nothing, and then it
continues.
Or similar, in the logs I see "Finished memstore flush" and then for 2
minutes nothing, and then it continues. And of course, insertions continue
also.

Is this just the normal behavior? Or did I miss-configure something?




On Wed, Jul 4, 2012 at 12:17 AM, Suraj Varma  wrote:

> In your case, likely you are hitting the blocking store files
> (hbase.hstore.blockingStoreFiles default:7) and/or
> hbase.hregion.memstore.block.multiplier - check out
> http://hbase.apache.org/book/config.files.html for more details on
> this configurations and how they affect your insert performance.
>
> On ganglia, also check whether you have a compaction queue spiking
> during these timeouts.
> --Suraj
>
>
> On Thu, Jun 21, 2012 at 4:27 AM, Martin Alig 
> wrote:
> > Thank you for the suggestions.
> >
> > So I changed the setup and now have:
> > 1 Master running Namenode, SecondaryNamenode, ZK and the HMaster
> > 7 Slaves running Datanode and Regionserver
> > 2 Clients to insert data
> >
> >
> > What I forgot in my first post, that sometimes the clients even get a
> > SocketTimeOutException when inserting the data. (of course during that
> time
> > 0 inserts are done)
> > By looking at the logs, (I also turned on the gc logs) I see the
> following:
> >
> > Multiple consecutive entries like:
> > 2012-06-21 11:42:13,962 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Blocking updates for 'IPC Server handler 6 on 60020' on region
> > usertable,user600,1340200683555.a45b03dd65a62afa676488921e47dbaa.:
> memstore
> > size 1.0g is >= than blocking 1.0g size
> >
> > Shortly after those entries, many entries like:
> > 2012-06-21 12:43:53,028 WARN org.apache.hadoop.ipc.HBaseServer:
> > (responseTooSlow):
> >
> {"processingtimems":35046,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@2642a14d
> ),
> > rpc version=1, client version=29,
> methodsFingerPrint=-1508511443","client":"
> > 10.110.129.12:54624
> >
> ","starttimems":1340275397981,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
> >
> > Looking at the gc-logs, many entries like:
> > 2870.329: [GC 2870.330: [ParNew: 108450K->3401K(118016K), 0.0182570 secs]
> > 4184711K->4079843K(12569856K), 0.0183510 secs] [Times: user=0.24
> sys=0.00,
> > real=0.01 secs]
> >
> > But always arround 0.01 secs - 0.04secs.
> >
> > And also from the gc-log:
> > 2696.013: [CMS-concurrent-sweep: 8.999/10.448 secs] [Times: user=46.93
> > sys=2.24, real=10.45 secs]
> >
> > Is the 10.45 secs too long?
> > Or what exactly should I watch out for in the gc logs?
> >
> >
> > I also configured ganglia to have a look at some more metrics. Looking at
> > io_wait (which should matter concerning my question to the disks), I can
> > observe values between 10 % and 25 % on the regionserver.
> > Should that be lower?
> >
> > Btw. I'm using HBase 0.94 and Hadoop 1.0.3.
> >
> >
> > Thank you again.
> >
> >
> > Martin
> >
> >
> >
> > On Wed, Jun 20, 2012 at 7:04 PM, Dave Wang  wrote:
> >
> >> I'd also remove the DN and RS from the node running ZK, NN, etc. as you
> >> don't want heavweight processes on that node.
> >>
> >> - Dave
> >>
> >> On Wed, Jun 20, 2012 at 9:31 AM, Elliott Clark  >> >wrote:
> >>
> >> > Basically without metrics on what's going on it's tough to know for
> sure.
> >> >
> >> > I would turn on GC logging and make sure that is not playing a part,
> get
> >> > metrics on IO while this is going on, and look through the logs to see
> >> what
> >> > is happening when you notice the pause.
> >> >
> >> > On Wed, Jun 20, 2012 at 6:39 AM, Martin Alig 
> >> > wrote:
> >> >
> >> > > Hi
> >> > >
> >> > > I'm doing some evaluations with HBase. The workload I'm facing is
> >> mainly
> >> > > insert-only.
> >> > > Currently I'm inserting 1KB rows, where 100Bytes go into one column.
> >> > >
> >> > > I have the following cluster machines at disposal:
> >> > >
> >> > > Intel Xeon L5520 2.26 Ghz (Nehalem, with HT enabled)
> >> > > 24 GiB Memory
> >> > > 1 GigE
> >> > > 2x 15k RPM Sas 73 GB (RAID1)
> >> > >
> >> > > I have 10 Nodes.
> >> > > The first node runs:
> >> > >
> >> > > Namenode, SecondaryNamenode, Datanode, HMaster, Zookeeper, and a
> >> > > RegionServer
> >> > >
> >> > > The other nodes run:
> >> > >
> >> > > Datanode and RegionServer
> >> > >
> >> > >
> >> > > Now running my test client and inserting rows, the throughput goes
> up
> >> to
> >> > > 150'000 inserts/sec. But then after some time the throughput drops
> down
> >> > to
> >> > > 0 inserts/sec for quite some time, before it goes up again.
> >> > > My assumption is, that it happens when the RegionServers start to
> write
> >> > the
> >> > > data from memory to the disks. I know, that the

Re: Reporting tool for Hbase

2012-07-12 Thread xkwang bruce
hi amlan,

It maybe that your Pentaho cannot connect cluster, so you should check your
config file patiently. Just my suggestion. I had't used Pentaho and
relatived tool.

bruce,

2012/7/12 Amlan Roy 

> Hi,
>
>
>
> I am looking for a reporting tool that can use Hbase data as input. Any
> recommendation?
>
>
>
> I am using Pentaho PDI because it can use Hbase data as input. But I am
> getting a strange error. My cluster is running, I can access data from my
> client program. But Pentaho is giving the following error. Not sure if it
> is
> because of version mismatch. Did anybody else face the same issue?
>
>
>
> org.apache.hadoop.hbase.MasterNotRunningException: Retried 1 times
>
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:127)
>
> at
>
> org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.jav
> a:1551)
>
> at
>
> org.pentaho.hbase.mapping.MappingAdmin.checkHBaseAvailable(MappingAdmin.java
> :131)
>
> at
>
> org.pentaho.hbase.mapping.MappingEditor.populateTableCombo(MappingEditor.jav
> a:466)
>
> at
> org.pentaho.hbase.mapping.MappingEditor.access$100(MappingEditor.java:88)
>
> at
>
> org.pentaho.hbase.mapping.MappingEditor$3.focusGained(MappingEditor.java:231
> )
>
> at
> org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
>
> at org.eclipse.swt.widgets.Widget.notifyListeners(Unknown
> Source)
>
> at org.eclipse.swt.custom.CCombo.handleFocus(Unknown
> Source)
>
> at org.eclipse.swt.custom.CCombo.textEvent(Unknown Source)
>
> at org.eclipse.swt.custom.CCombo$1.handleEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
>
> at org.eclipse.swt.widgets.Control.sendFocusEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Widget.wmSetFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.WM_SETFOCUS(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.windowProc(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Text.windowProc(Unknown Source)
>
> at org.eclipse.swt.widgets.Display.windowProc(Unknown
> Source)
>
> at org.eclipse.swt.internal.win32.OS.SetFocus(Native
> Method)
>
> at org.eclipse.swt.widgets.Control.forceFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.fixFocus(Unknown Source)
>
> at org.eclipse.swt.widgets.Control.setVisible(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder.setSelection(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder.setSelection(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder.onMouse(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder$1.handleEvent(Unknown
> Source)
>
>
>
> Regards,
>
> Amlan
>
>
>
>


Reporting tool for Hbase

2012-07-12 Thread Amlan Roy
Hi,

 

I am looking for a reporting tool that can use Hbase data as input. Any
recommendation?

 

I am using Pentaho PDI because it can use Hbase data as input. But I am
getting a strange error. My cluster is running, I can access data from my
client program. But Pentaho is giving the following error. Not sure if it is
because of version mismatch. Did anybody else face the same issue? 

 

org.apache.hadoop.hbase.MasterNotRunningException: Retried 1 times

at
org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:127)

at
org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.jav
a:1551)

at
org.pentaho.hbase.mapping.MappingAdmin.checkHBaseAvailable(MappingAdmin.java
:131)

at
org.pentaho.hbase.mapping.MappingEditor.populateTableCombo(MappingEditor.jav
a:466)

at
org.pentaho.hbase.mapping.MappingEditor.access$100(MappingEditor.java:88)

at
org.pentaho.hbase.mapping.MappingEditor$3.focusGained(MappingEditor.java:231
)

at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown
Source)

at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown
Source)

at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)

at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)

at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)

at org.eclipse.swt.widgets.Widget.notifyListeners(Unknown
Source)

at org.eclipse.swt.custom.CCombo.handleFocus(Unknown Source)

at org.eclipse.swt.custom.CCombo.textEvent(Unknown Source)

at org.eclipse.swt.custom.CCombo$1.handleEvent(Unknown
Source)

at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown
Source)

at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)

at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)

at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)

at org.eclipse.swt.widgets.Control.sendFocusEvent(Unknown
Source)

at org.eclipse.swt.widgets.Widget.wmSetFocus(Unknown Source)

at org.eclipse.swt.widgets.Control.WM_SETFOCUS(Unknown
Source)

at org.eclipse.swt.widgets.Control.windowProc(Unknown
Source)

at org.eclipse.swt.widgets.Text.windowProc(Unknown Source)

at org.eclipse.swt.widgets.Display.windowProc(Unknown
Source)

at org.eclipse.swt.internal.win32.OS.SetFocus(Native Method)

at org.eclipse.swt.widgets.Control.forceFocus(Unknown
Source)

at org.eclipse.swt.widgets.Control.setFixedFocus(Unknown
Source)

at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
Source)

at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
Source)

at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
Source)

at org.eclipse.swt.widgets.Control.fixFocus(Unknown Source)

at org.eclipse.swt.widgets.Control.setVisible(Unknown
Source)

at org.eclipse.swt.custom.CTabFolder.setSelection(Unknown
Source)

at org.eclipse.swt.custom.CTabFolder.setSelection(Unknown
Source)

at org.eclipse.swt.custom.CTabFolder.onMouse(Unknown Source)

at org.eclipse.swt.custom.CTabFolder$1.handleEvent(Unknown
Source) 

 

Regards,

Amlan

 



RE: Reporting tool for Hbase

2012-07-12 Thread Amlan Roy
Hi Sonal,

I am using hbase-0.92.0 with hadoop-1.0.0. Pentaho was using hbase-0.90.3
with hadoop-0.20.2. I replaced those jars with the jars I am using and
restarted Pentaho. The issue was not resolved. I searched for the logs but
did not find any in Pentaho.

I will take a look at Crux. Thanks a lot.

Regards,
Amlan

-Original Message-
From: Sonal Goyal [mailto:sonalgoy...@gmail.com] 
Sent: Thursday, July 12, 2012 12:20 PM
To: user@hbase.apache.org
Subject: Re: Reporting tool for Hbase

Hi Amlan,

Which versions are you running on? Do you see any errors in the hbase logs?

For reporting over Hbase, you can also take a look at Crux at
http://github.com/sonalgoyal/crux

Best Regards,
Sonal
[1] Crux: Reporting for HBase 
Nube Technologies 







On Sun, Jul 8, 2012 at 12:15 PM, Amlan Roy  wrote:

> Hi,
>
>
>
> I am looking for a reporting tool that can use Hbase data as input. Any
> recommendation?
>
>
>
> I am using Pentaho PDI because it can use Hbase data as input. But I am
> getting a strange error. My cluster is running, I can access data from my
> client program. But Pentaho is giving the following error. Not sure if it
> is
> because of version mismatch. Did anybody else face the same issue?
>
>
>
> org.apache.hadoop.hbase.MasterNotRunningException: Retried 1 times
>
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:127)
>
> at
>
>
org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.jav
> a:1551)
>
> at
>
>
org.pentaho.hbase.mapping.MappingAdmin.checkHBaseAvailable(MappingAdmin.java
> :131)
>
> at
>
>
org.pentaho.hbase.mapping.MappingEditor.populateTableCombo(MappingEditor.jav
> a:466)
>
> at
> org.pentaho.hbase.mapping.MappingEditor.access$100(MappingEditor.java:88)
>
> at
>
>
org.pentaho.hbase.mapping.MappingEditor$3.focusGained(MappingEditor.java:231
> )
>
> at
> org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown
Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown
Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown
Source)
>
> at org.eclipse.swt.widgets.Widget.notifyListeners(Unknown
> Source)
>
> at org.eclipse.swt.custom.CCombo.handleFocus(Unknown
> Source)
>
> at org.eclipse.swt.custom.CCombo.textEvent(Unknown Source)
>
> at org.eclipse.swt.custom.CCombo$1.handleEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown
Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown
Source)
>
> at org.eclipse.swt.widgets.Widget.sendEvent(Unknown
Source)
>
> at org.eclipse.swt.widgets.Control.sendFocusEvent(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Widget.wmSetFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.WM_SETFOCUS(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.windowProc(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Text.windowProc(Unknown Source)
>
> at org.eclipse.swt.widgets.Display.windowProc(Unknown
> Source)
>
> at org.eclipse.swt.internal.win32.OS.SetFocus(Native
> Method)
>
> at org.eclipse.swt.widgets.Control.forceFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Composite.setFixedFocus(Unknown
> Source)
>
> at org.eclipse.swt.widgets.Control.fixFocus(Unknown
Source)
>
> at org.eclipse.swt.widgets.Control.setVisible(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder.setSelection(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder.setSelection(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder.onMouse(Unknown
> Source)
>
> at org.eclipse.swt.custom.CTabFolder$1.handleEvent(Unknown
> Source)
>
>
>
> Regards,
>
> Amlan
>
>