Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

2012-08-21 Thread anil gupta
Hi Alex,

Thanks for creating the JIRA.
On Monday, I completed testing the time range filtering using timestamps
and IMO the results seems satisfactory(if not great). The table has 34
million records(average row size is 1.21 KB), in 136 seconds i get the
entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster, does the performance sounds OK
for timestamp filtering?

Thanks,
Anil

On Mon, Aug 20, 2012 at 1:07 PM, Alex Baranau wrote:

> Created: https://issues.apache.org/jira/browse/HBASE-6618
>
> Alex Baranau
> --
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> On Sat, Aug 18, 2012 at 5:02 PM, anil gupta  wrote:
>
> > Hi Alex,
> >
> > Apart from the query which i mentioned in last email. Till now, i have
> > implemented the following queries using filters and coprocessors:
> >
> > 1. Getting all the records for a customer.
> > 2. Perform min,max,avg,sum aggregation for a customer using
> coprocessors. I
> > am storing some of the data as BigDecimal also to do accurate floating
> > point calculations.
> > 3. Perform min,max,avg,sum aggregation for a customer within a given
> > time-range using coprocessors.
> > 4. Filter that data for a customer within a given time-range on the basis
> > of column values. The filtering on column values can be matching a string
> > value or it can be doing range based numerical comparison.
> >
> > Basically, as per our current requirement all the queries have customerid
> > and most of the queries have timerange also. We are not in prod yet. All
> of
> > this effort is part of a POC.
> >
> > 2. Can you set timestamp on Puts the same as timestamp "assigned" to your
> > record by app logic?
> > Anil: Wow! This sounds like an awesome idea. Actually, my data is
> > non-mutable so at present i was putting 0 as the timestamp for all the
> > data. I will definitely try this stuff. Currently, i run bulkloader to
> load
> > the data so i think its gonna be a small change.
> >
> > Yes, i would love to give a try from my side for developing a range based
> > FuzzyRowFilter. However, first i am going to try putting in the
> timestamp.
> >
> > Thanks for a very helpful discussion. Let me know when you create the
> JIRA
> > for range-based FuzzyRowFilter.
> >
> > Thanks,
> > Anil Gupta
> >
> > On Sat, Aug 18, 2012 at 12:13 PM, Alex Baranau  > >wrote:
> >
> > > @Michael,
> > >
> > > This is not a simple partial key scan. Take this example of rows:
> > >
> > > a_11_20120801
> > > a_11_20120802
> > > a_11_20120802
> > > a_11_20120803
> > > a_11_20120804
> > > a_11_20120805
> > > a_12_20120801
> > > a_12_20120802
> > > a_12_20120802
> > > a_12_20120803
> > > a_12_20120804
> > > a_12_20120805
> > >
> > > where a is userId, 1x is actionId and 201208xx is a timestamp.
> If
> > > the query is to select actions in the range 20120803-20120805 (in this
> > case
> > > last 3 days), then when scan encounters row:
> > >
> > > a_11_20120801
> > >
> > > it "knows" it can fast forward scanning to "a_11_20120803", and
> > > skip some records (in practice, this may mean skipping really a LOT of
> > > recrods).
> > >
> > >
> > > @Anil,
> > >
> > > > Sample Query: I want to get all the event which happened in last
> month.
> > >
> > > 1. What other queries do you do? Just trying to understand why this row
> > key
> > > format was chosen.
> > >
> > > 2. Can you set timestamp on Puts the same as timestamp "assigned" to
> your
> > > record by app logic? If you can, then this is the first thing to try
> and
> > > perform scan with the help of scan.setTimeRange(startTs, stopTs).
> > Depending
> > > on how you write the data this may help a lot with the reading speed by
> > ts,
> > > because that way you may skip the whole HFiles from reading based on
> ts.
> > I
> > > don't know about your data a lot to judge, but:
> > >   * in case you have not a lot of users most of which are with long
> > history
> > > of interaction with you system (i.e. there are a lot of records for
> > > specific "userX_actionY") and
> > >   * if you write data with monotonically increasing timestamp
> > >   * your regions are not too big
> > > then this might help you, as it will increase the chance that some of
> the
> > > HFiles will contain data *all of which* doesn't fell into the time
> > interval
> > > you select by. Otherwise, if written data items with different
> timestamps
> > > are very well spread across the HFiles the chance that some HFiles are
> > > skipped from reading is very small. I believe La

Re: Slow full-table scans

2012-08-21 Thread J Mohamed Zahoor
Try a quick TestDFSIO to see if things are okay.

./zahoor

On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia wrote:

> It's possible that there is a bad or slower disk on Gurjeet's machine. I
> think details of iostat and cpu would clear things up.
>
> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl 
> wrote:
>
> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
> > 100
> >
> >
> >
> > 
> >  From: Gurjeet Singh 
> > To: user@hbase.apache.org; lars hofhansl 
> > Sent: Tuesday, August 21, 2012 11:31 AM
> >  Subject: Re: Slow full-table scans
> >
> > How does that compare with the newScanTable on your build ?
> >
> > Gurjeet
> >
> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl 
> > wrote:
> > > Hmm... So I tried in HBase (current trunk).
> > > I created 100 rows with 200.000 columns each (using your oldMakeTable).
> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
> > distributed mode - with your oldScanTable).
> > >
> > > -- Lars
> > >
> > >
> > >
> > > - Original Message -
> > > From: lars hofhansl 
> > > To: "user@hbase.apache.org" 
> > > Cc:
> > > Sent: Monday, August 20, 2012 7:50 PM
> > > Subject: Re: Slow full-table scans
> > >
> > > Thanks Gurjeet,
> > >
> > > I'll (hopefully) have a look tomorrow.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > - Original Message -
> > > From: Gurjeet Singh 
> > > To: user@hbase.apache.org; lars hofhansl 
> > > Cc:
> > > Sent: Monday, August 20, 2012 7:42 PM
> > > Subject: Re: Slow full-table scans
> > >
> > > Hi Lars,
> > >
> > > Here is a testcase:
> > >
> > > https://gist.github.com/3410948
> > >
> > > Benchmarking code:
> > >
> > > https://gist.github.com/3410952
> > >
> > > Try running it with numRows = 100, numCols = 20, segmentSize = 1000
> > >
> > > Gurjeet
> > >
> > >
> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh 
> > wrote:
> > >> Sure - I can create a minimal testcase and send it along.
> > >>
> > >> Gurjeet
> > >>
> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl 
> > wrote:
> > >>> That's interesting.
> > >>> Could you share your old and new schema. I would like to track down
> > the performance problems you saw.
> > >>> (If you had a demo program that populates your rows with 200.000
> > columns in a way where you saw the performance issues, that'd be even
> > better, but not necessary).
> > >>>
> > >>>
> > >>> -- Lars
> > >>>
> > >>>
> > >>>
> > >>> 
> > >>>  From: Gurjeet Singh 
> > >>> To: user@hbase.apache.org; lars hofhansl 
> > >>> Sent: Thursday, August 16, 2012 11:26 AM
> > >>> Subject: Re: Slow full-table scans
> > >>>
> > >>> Sorry for the delay guys.
> > >>>
> > >>> Here are a few results:
> > >>>
> > >>> 1. Regions in the table = 11
> > >>> 2. The region servers don't appear to be very busy with the query ~5%
> > >>> CPU (but with parallelization, they are all busy)
> > >>>
> > >>> Finally, I changed the format of my data, such that each cell in
> HBase
> > >>> contains a chunk of a row instead of the single value it had. So,
> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a
> > >>> performance boost of 1000x. It seems that the underlying issue was IO
> > >>> overhead per byte of actual data stored.
> > >>>
> > >>>
> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl 
> > wrote:
> >  Yeah... It looks OK.
> >  Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
> > 
> > 
> >  If you can I'd like to know how busy your regionservers are during
> > these operations. That would be an indication on whether the
> > parallelization is good or not.
> > 
> >  -- Lars
> > 
> > 
> >  - Original Message -
> >  From: Stack 
> >  To: user@hbase.apache.org
> >  Cc:
> >  Sent: Wednesday, August 15, 2012 3:13 PM
> >  Subject: Re: Slow full-table scans
> > 
> >  On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh 
> > wrote:
> > > I am beginning to think that this is a configuration issue on my
> > > cluster. Do the following configuration files seem sane ?
> > >
> > > hbase-env.shhttps://gist.github.com/3345338
> > >
> > 
> >  Nothing wrong w/ this (Remove the -ea, you don't want asserts in
> >  production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
> > 
> > 
> > > hbase-site.xmlhttps://gist.github.com/3345356
> > >
> > 
> >  This is all defaults effectively.   I don't see any of the configs.
> >  recommended by the performance section of the reference guide and/or
> >  those suggested by the GBIF blog.
> > 
> >  You don't answer LarsH's query about where you see the 4%
> difference.
> > 
> >  How many regions in your table?  Whats the HBase Master UI look like
> >  when this scan is running?
> >  St.Ack
> > 
> >
>


NoClassDefFoundError: com.sun.security.auth .NTUserPrincipal

2012-08-21 Thread Ted Yu
Hi,
When using HBase 0.92 jar on IBM JVM, we saw the following exception:

Setting up data table [qa_id1] for environment [qa] if necessary.
Reading from Hbase properties file...
Exception in thread "main" java.lang.NoClassDefFoundError:
com.sun.security.auth
.NTUserPrincipal
at
org.apache.hadoop.security.UserGroupInformation.(UserGroupInf
ormation.java:310)
at java.lang.J9VMInternals.initializeImpl(Native Method)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
at org.apache.hadoop.hbase.security.User.call(User.java:586)
at org.apache.hadoop.hbase.security.User.callStatic(User.java:576)
at org.apache.hadoop.hbase.security.User.access$400(User.java:50)
at
org.apache.hadoop.hbase.security.User$SecureHadoopUser.(User.ja
va:393)
at
org.apache.hadoop.hbase.security.User$SecureHadoopUser.(User.ja
va:388)
at org.apache.hadoop.hbase.security.User.getCurrent(User.java:139)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionKey.(HConnectionManager.java:412)
at
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConn
ectionManager.java:182)
at
org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:98)
at
com.ebay.evps.hb.EPSAdminToolTest.createBaseTableIfNeeded(EPSAdminToo
lTest.java:242)
at
com.ebay.evps.hb.EPSAdminToolTest.setupEnvironments(EPSAdminToolTest.
java:167)
at com.ebay.evps.hb.EPSAdminToolTest.main(EPSAdminToolTest.java:53)
Caused by: java.lang.ClassNotFoundException:
com.sun.security.auth.NTUserPrincip
al
at java.net.URLClassLoader.findClass(URLClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:643)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:320)
at java.lang.ClassLoader.loadClass(ClassLoader.java:609)
... 20 more

I wonder if anyone has experienced similar problem before.

Your feedback would be appreciated.


Re: Using HBase serving to replace memcached

2012-08-21 Thread J Mohamed Zahoor
>
> I could be wrong. I think HFile index block (which is located at the end
> of HFile) is a binary search tree containing all row-key values (of the
> HFile) in the binary search tree. Searching a specific row-key in the
> binary search tree could easily find whether a row-key exists (some node in
> the tree has the same row-key value) or not. Why we need load every block
> to find if the row exists?
>
>
Hmm...
It is a multilevel index. Only the root Index's (Data, Meta etc) are loaded
when a region is opened. The rest of the tree (intermediate and leaf
index's) are present in each block level.
I am assuming a HFile v2 here for the discussion.
Read this for more clarity http://hbase.apache.org/book/apes03.html

Nice discussion. You made me read lot of things. :-)
Now i will dig in to the code and check this out.

./Zahoor


Re: Help with parser

2012-08-21 Thread Stack
On Mon, Aug 20, 2012 at 6:02 PM, Harish Krishnan
 wrote:
> I'm trying to write an application that gets the hbase queries from users
> and returns the results.
> I wanted to use the parser class to validate user queries.
>

Users will be using the shell to query hbase?
St.Ack


Re: Slow full-table scans

2012-08-21 Thread Mohit Anchlia
It's possible that there is a bad or slower disk on Gurjeet's machine. I
think details of iostat and cpu would clear things up.

On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl  wrote:

> I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
> 100
>
>
>
> 
>  From: Gurjeet Singh 
> To: user@hbase.apache.org; lars hofhansl 
> Sent: Tuesday, August 21, 2012 11:31 AM
>  Subject: Re: Slow full-table scans
>
> How does that compare with the newScanTable on your build ?
>
> Gurjeet
>
> On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl 
> wrote:
> > Hmm... So I tried in HBase (current trunk).
> > I created 100 rows with 200.000 columns each (using your oldMakeTable).
> The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
> distributed mode - with your oldScanTable).
> >
> > -- Lars
> >
> >
> >
> > - Original Message -
> > From: lars hofhansl 
> > To: "user@hbase.apache.org" 
> > Cc:
> > Sent: Monday, August 20, 2012 7:50 PM
> > Subject: Re: Slow full-table scans
> >
> > Thanks Gurjeet,
> >
> > I'll (hopefully) have a look tomorrow.
> >
> > -- Lars
> >
> >
> >
> > - Original Message -
> > From: Gurjeet Singh 
> > To: user@hbase.apache.org; lars hofhansl 
> > Cc:
> > Sent: Monday, August 20, 2012 7:42 PM
> > Subject: Re: Slow full-table scans
> >
> > Hi Lars,
> >
> > Here is a testcase:
> >
> > https://gist.github.com/3410948
> >
> > Benchmarking code:
> >
> > https://gist.github.com/3410952
> >
> > Try running it with numRows = 100, numCols = 20, segmentSize = 1000
> >
> > Gurjeet
> >
> >
> > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh 
> wrote:
> >> Sure - I can create a minimal testcase and send it along.
> >>
> >> Gurjeet
> >>
> >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl 
> wrote:
> >>> That's interesting.
> >>> Could you share your old and new schema. I would like to track down
> the performance problems you saw.
> >>> (If you had a demo program that populates your rows with 200.000
> columns in a way where you saw the performance issues, that'd be even
> better, but not necessary).
> >>>
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> 
> >>>  From: Gurjeet Singh 
> >>> To: user@hbase.apache.org; lars hofhansl 
> >>> Sent: Thursday, August 16, 2012 11:26 AM
> >>> Subject: Re: Slow full-table scans
> >>>
> >>> Sorry for the delay guys.
> >>>
> >>> Here are a few results:
> >>>
> >>> 1. Regions in the table = 11
> >>> 2. The region servers don't appear to be very busy with the query ~5%
> >>> CPU (but with parallelization, they are all busy)
> >>>
> >>> Finally, I changed the format of my data, such that each cell in HBase
> >>> contains a chunk of a row instead of the single value it had. So,
> >>> stuffing each Hbase cell with 500 columns of a row, gave me a
> >>> performance boost of 1000x. It seems that the underlying issue was IO
> >>> overhead per byte of actual data stored.
> >>>
> >>>
> >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl 
> wrote:
>  Yeah... It looks OK.
>  Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
> 
> 
>  If you can I'd like to know how busy your regionservers are during
> these operations. That would be an indication on whether the
> parallelization is good or not.
> 
>  -- Lars
> 
> 
>  - Original Message -
>  From: Stack 
>  To: user@hbase.apache.org
>  Cc:
>  Sent: Wednesday, August 15, 2012 3:13 PM
>  Subject: Re: Slow full-table scans
> 
>  On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh 
> wrote:
> > I am beginning to think that this is a configuration issue on my
> > cluster. Do the following configuration files seem sane ?
> >
> > hbase-env.shhttps://gist.github.com/3345338
> >
> 
>  Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>  production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
> 
> 
> > hbase-site.xmlhttps://gist.github.com/3345356
> >
> 
>  This is all defaults effectively.   I don't see any of the configs.
>  recommended by the performance section of the reference guide and/or
>  those suggested by the GBIF blog.
> 
>  You don't answer LarsH's query about where you see the 4% difference.
> 
>  How many regions in your table?  Whats the HBase Master UI look like
>  when this scan is running?
>  St.Ack
> 
>


Re: HBase Put

2012-08-21 Thread lars hofhansl
That is correct.




 From: "Pamecha, Abhishek" 
To: "user@hbase.apache.org" ; lars hofhansl 
 
Sent: Tuesday, August 21, 2012 4:45 PM
Subject: RE: HBase Put
 
Hi Lars,

Thanks for the explanation. I still have a little doubt:

Based on your description, given gets do a merge sort, the data on disk is not 
kept sorted across files, but just sorted within a file.

So, basically if on two separate days, say these keys get inserted: 

Day1: File1:   A B J M
Day2: File2:  C D K P

Then each file is sorted within itself, but scanning both files will require 
Hbase to use merge sort to produce a sorted result. Right?

Also, File 1 and File2 are immutable, and during compactions, File 1 and File2 
are compacted and sorted using merge sort to a bigger File3. Is that correct 
too?

Thanks,
Abhishek


-Original Message-
From: lars hofhansl [mailto:lhofha...@yahoo.com] 
Sent: Tuesday, August 21, 2012 4:07 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file 
(which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, 
larger files


So the data files (HFiles) are immutable once written, changes are batched in 
memory first.

-- Lars




From: "Pamecha, Abhishek" 
To: "user@hbase.apache.org" 
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put

Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted 
without any order to column qualifiers, how does Hbase maintain sortedness wrt 
column qualifiers in its store files/blocks?

I checked the code base and I can see 
checks
 being  made for lexicographic insertions for Key value pairs.  But I cant seem 
to find out how the key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make 
their way to sorted order. Is it only during minor/major compactions, that this 
sortedness gets applied and that there is a small window during which data is 
not sorted?


Thanks,
Abhishek

RE: HBase Put

2012-08-21 Thread Pamecha, Abhishek
Hi Lars,

Thanks for the explanation. I still have a little doubt:

Based on your description, given gets do a merge sort, the data on disk is not 
kept sorted across files, but just sorted within a file.

So, basically if on two separate days, say these keys get inserted: 

Day1: File1:   A B J M
Day2: File2:  C D K P

Then each file is sorted within itself, but scanning both files will require 
Hbase to use merge sort to produce a sorted result. Right?

Also, File 1 and File2 are immutable, and during compactions, File 1 and File2 
are compacted and sorted using merge sort to a bigger File3. Is that correct 
too?

Thanks,
Abhishek


-Original Message-
From: lars hofhansl [mailto:lhofha...@yahoo.com] 
Sent: Tuesday, August 21, 2012 4:07 PM
To: user@hbase.apache.org
Subject: Re: HBase Put

In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file 
(which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, 
larger files


So the data files (HFiles) are immutable once written, changes are batched in 
memory first.

-- Lars




 From: "Pamecha, Abhishek" 
To: "user@hbase.apache.org" 
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put
 
Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted 
without any order to column qualifiers, how does Hbase maintain sortedness wrt 
column qualifiers in its store files/blocks?

I checked the code base and I can see 
checks
 being  made for lexicographic insertions for Key value pairs.  But I cant seem 
to find out how the key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make 
their way to sorted order. Is it only during minor/major compactions, that this 
sortedness gets applied and that there is a small window during which data is 
not sorted?


Thanks,
Abhishek


Re: Slow full-table scans

2012-08-21 Thread lars hofhansl
I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size 100




 From: Gurjeet Singh 
To: user@hbase.apache.org; lars hofhansl  
Sent: Tuesday, August 21, 2012 11:31 AM
Subject: Re: Slow full-table scans
 
How does that compare with the newScanTable on your build ?

Gurjeet

On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl  wrote:
> Hmm... So I tried in HBase (current trunk).
> I created 100 rows with 200.000 columns each (using your oldMakeTable). The 
> creation took a bit, but scanning finished in 1.8s. (HBase in pseudo 
> distributed mode - with your oldScanTable).
>
> -- Lars
>
>
>
> - Original Message -
> From: lars hofhansl 
> To: "user@hbase.apache.org" 
> Cc:
> Sent: Monday, August 20, 2012 7:50 PM
> Subject: Re: Slow full-table scans
>
> Thanks Gurjeet,
>
> I'll (hopefully) have a look tomorrow.
>
> -- Lars
>
>
>
> - Original Message -
> From: Gurjeet Singh 
> To: user@hbase.apache.org; lars hofhansl 
> Cc:
> Sent: Monday, August 20, 2012 7:42 PM
> Subject: Re: Slow full-table scans
>
> Hi Lars,
>
> Here is a testcase:
>
> https://gist.github.com/3410948
>
> Benchmarking code:
>
> https://gist.github.com/3410952
>
> Try running it with numRows = 100, numCols = 20, segmentSize = 1000
>
> Gurjeet
>
>
> On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh  wrote:
>> Sure - I can create a minimal testcase and send it along.
>>
>> Gurjeet
>>
>> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl  wrote:
>>> That's interesting.
>>> Could you share your old and new schema. I would like to track down the 
>>> performance problems you saw.
>>> (If you had a demo program that populates your rows with 200.000 columns in 
>>> a way where you saw the performance issues, that'd be even better, but not 
>>> necessary).
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> 
>>>  From: Gurjeet Singh 
>>> To: user@hbase.apache.org; lars hofhansl 
>>> Sent: Thursday, August 16, 2012 11:26 AM
>>> Subject: Re: Slow full-table scans
>>>
>>> Sorry for the delay guys.
>>>
>>> Here are a few results:
>>>
>>> 1. Regions in the table = 11
>>> 2. The region servers don't appear to be very busy with the query ~5%
>>> CPU (but with parallelization, they are all busy)
>>>
>>> Finally, I changed the format of my data, such that each cell in HBase
>>> contains a chunk of a row instead of the single value it had. So,
>>> stuffing each Hbase cell with 500 columns of a row, gave me a
>>> performance boost of 1000x. It seems that the underlying issue was IO
>>> overhead per byte of actual data stored.
>>>
>>>
>>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl  wrote:
 Yeah... It looks OK.
 Maybe 2G of heap is a bit low when dealing with 200.000 column rows.


 If you can I'd like to know how busy your regionservers are during these 
 operations. That would be an indication on whether the parallelization is 
 good or not.

 -- Lars


 - Original Message -
 From: Stack 
 To: user@hbase.apache.org
 Cc:
 Sent: Wednesday, August 15, 2012 3:13 PM
 Subject: Re: Slow full-table scans

 On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh  wrote:
> I am beginning to think that this is a configuration issue on my
> cluster. Do the following configuration files seem sane ?
>
> hbase-env.sh    https://gist.github.com/3345338
>

 Nothing wrong w/ this (Remove the -ea, you don't want asserts in
 production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).


> hbase-site.xml    https://gist.github.com/3345356
>

 This is all defaults effectively.   I don't see any of the configs.
 recommended by the performance section of the reference guide and/or
 those suggested by the GBIF blog.

 You don't answer LarsH's query about where you see the 4% difference.

 How many regions in your table?  Whats the HBase Master UI look like
 when this scan is running?
 St.Ack


Re: HBase Put

2012-08-21 Thread lars hofhansl
In a nutshell:
- Puts are collected in memory (in a sorted data structure)
- When the collected data reaches a certain size it is flushed to a new file 
(which is sorted)
- Gets do a merge sort between the various files that have been created
- to contain the number of files they are periodically compacted into fewer, 
larger files


So the data files (HFiles) are immutable once written, changes are batched in 
memory first.

-- Lars




 From: "Pamecha, Abhishek" 
To: "user@hbase.apache.org"  
Sent: Tuesday, August 21, 2012 4:00 PM
Subject: HBase Put
 
Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted 
without any order to column qualifiers, how does Hbase maintain sortedness wrt 
column qualifiers in its store files/blocks?

I checked the code base and I can see 
checks
 being  made for lexicographic insertions for Key value pairs.  But I cant seem 
to find out how the
key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make 
their way to sorted order. Is it only during minor/major compactions, that this 
sortedness gets applied and that there is a small window during which data is 
not sorted?


Thanks,
Abhishek

HBase Put

2012-08-21 Thread Pamecha, Abhishek
Hi

I had a  question on Hbase Put call. In the scenario, where data is inserted 
without any order to column qualifiers, how does Hbase maintain sortedness wrt 
column qualifiers in its store files/blocks?

I checked the code base and I can see 
checks
 being  made for lexicographic insertions for Key value pairs.  But I cant seem 
to find out how the
key-offset is calculated in the first place?

Also, given HDFS is by nature, append only, how do randomly ordered keys make 
their way to sorted order. Is it only during minor/major compactions, that this 
sortedness gets applied and that there is a small window during which data is 
not sorted?


Thanks,
Abhishek



Re: Supervisord

2012-08-21 Thread Stack
On Tue, Aug 21, 2012 at 2:57 PM, Marco Gallotta  wrote:
> Is it possible to run the hbase processes in the foreground so that they can 
> be run and monitored by supervisord?
>

Try defining HBASE_NOEXEC when you run it.
St.Ack


Supervisord

2012-08-21 Thread Marco Gallotta
Is it possible to run the hbase processes in the foreground so that they can be 
run and monitored by supervisord? 

-- 
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
ma...@gallotta.co.za | +1 (650) 417-3313

Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



Supervisord

2012-08-21 Thread Marco Gallotta
Is it possible to run the hbase processes in the foreground so that they can be 
run and monitored by supervisord? 

-- 
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
ma...@gallotta.co.za | +1 (650) 417-3313

Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



Re: Slow full-table scans

2012-08-21 Thread Gurjeet Singh
How does that compare with the newScanTable on your build ?

Gurjeet

On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl  wrote:
> Hmm... So I tried in HBase (current trunk).
> I created 100 rows with 200.000 columns each (using your oldMakeTable). The 
> creation took a bit, but scanning finished in 1.8s. (HBase in pseudo 
> distributed mode - with your oldScanTable).
>
> -- Lars
>
>
>
> - Original Message -
> From: lars hofhansl 
> To: "user@hbase.apache.org" 
> Cc:
> Sent: Monday, August 20, 2012 7:50 PM
> Subject: Re: Slow full-table scans
>
> Thanks Gurjeet,
>
> I'll (hopefully) have a look tomorrow.
>
> -- Lars
>
>
>
> - Original Message -
> From: Gurjeet Singh 
> To: user@hbase.apache.org; lars hofhansl 
> Cc:
> Sent: Monday, August 20, 2012 7:42 PM
> Subject: Re: Slow full-table scans
>
> Hi Lars,
>
> Here is a testcase:
>
> https://gist.github.com/3410948
>
> Benchmarking code:
>
> https://gist.github.com/3410952
>
> Try running it with numRows = 100, numCols = 20, segmentSize = 1000
>
> Gurjeet
>
>
> On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh  wrote:
>> Sure - I can create a minimal testcase and send it along.
>>
>> Gurjeet
>>
>> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl  wrote:
>>> That's interesting.
>>> Could you share your old and new schema. I would like to track down the 
>>> performance problems you saw.
>>> (If you had a demo program that populates your rows with 200.000 columns in 
>>> a way where you saw the performance issues, that'd be even better, but not 
>>> necessary).
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> 
>>>  From: Gurjeet Singh 
>>> To: user@hbase.apache.org; lars hofhansl 
>>> Sent: Thursday, August 16, 2012 11:26 AM
>>> Subject: Re: Slow full-table scans
>>>
>>> Sorry for the delay guys.
>>>
>>> Here are a few results:
>>>
>>> 1. Regions in the table = 11
>>> 2. The region servers don't appear to be very busy with the query ~5%
>>> CPU (but with parallelization, they are all busy)
>>>
>>> Finally, I changed the format of my data, such that each cell in HBase
>>> contains a chunk of a row instead of the single value it had. So,
>>> stuffing each Hbase cell with 500 columns of a row, gave me a
>>> performance boost of 1000x. It seems that the underlying issue was IO
>>> overhead per byte of actual data stored.
>>>
>>>
>>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl  wrote:
 Yeah... It looks OK.
 Maybe 2G of heap is a bit low when dealing with 200.000 column rows.


 If you can I'd like to know how busy your regionservers are during these 
 operations. That would be an indication on whether the parallelization is 
 good or not.

 -- Lars


 - Original Message -
 From: Stack 
 To: user@hbase.apache.org
 Cc:
 Sent: Wednesday, August 15, 2012 3:13 PM
 Subject: Re: Slow full-table scans

 On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh  wrote:
> I am beginning to think that this is a configuration issue on my
> cluster. Do the following configuration files seem sane ?
>
> hbase-env.shhttps://gist.github.com/3345338
>

 Nothing wrong w/ this (Remove the -ea, you don't want asserts in
 production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).


> hbase-site.xmlhttps://gist.github.com/3345356
>

 This is all defaults effectively.   I don't see any of the configs.
 recommended by the performance section of the reference guide and/or
 those suggested by the GBIF blog.

 You don't answer LarsH's query about where you see the 4% difference.

 How many regions in your table?  Whats the HBase Master UI look like
 when this scan is running?
 St.Ack



Re: Slow full-table scans

2012-08-21 Thread lars hofhansl
Hmm... So I tried in HBase (current trunk).
I created 100 rows with 200.000 columns each (using your oldMakeTable). The 
creation took a bit, but scanning finished in 1.8s. (HBase in pseudo 
distributed mode - with your oldScanTable).

-- Lars



- Original Message -
From: lars hofhansl 
To: "user@hbase.apache.org" 
Cc: 
Sent: Monday, August 20, 2012 7:50 PM
Subject: Re: Slow full-table scans

Thanks Gurjeet,

I'll (hopefully) have a look tomorrow.

-- Lars



- Original Message -
From: Gurjeet Singh 
To: user@hbase.apache.org; lars hofhansl 
Cc: 
Sent: Monday, August 20, 2012 7:42 PM
Subject: Re: Slow full-table scans

Hi Lars,

Here is a testcase:

https://gist.github.com/3410948

Benchmarking code:

https://gist.github.com/3410952

Try running it with numRows = 100, numCols = 20, segmentSize = 1000

Gurjeet


On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh  wrote:
> Sure - I can create a minimal testcase and send it along.
>
> Gurjeet
>
> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl  wrote:
>> That's interesting.
>> Could you share your old and new schema. I would like to track down the 
>> performance problems you saw.
>> (If you had a demo program that populates your rows with 200.000 columns in 
>> a way where you saw the performance issues, that'd be even better, but not 
>> necessary).
>>
>>
>> -- Lars
>>
>>
>>
>> 
>>  From: Gurjeet Singh 
>> To: user@hbase.apache.org; lars hofhansl 
>> Sent: Thursday, August 16, 2012 11:26 AM
>> Subject: Re: Slow full-table scans
>>
>> Sorry for the delay guys.
>>
>> Here are a few results:
>>
>> 1. Regions in the table = 11
>> 2. The region servers don't appear to be very busy with the query ~5%
>> CPU (but with parallelization, they are all busy)
>>
>> Finally, I changed the format of my data, such that each cell in HBase
>> contains a chunk of a row instead of the single value it had. So,
>> stuffing each Hbase cell with 500 columns of a row, gave me a
>> performance boost of 1000x. It seems that the underlying issue was IO
>> overhead per byte of actual data stored.
>>
>>
>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl  wrote:
>>> Yeah... It looks OK.
>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>
>>>
>>> If you can I'd like to know how busy your regionservers are during these 
>>> operations. That would be an indication on whether the parallelization is 
>>> good or not.
>>>
>>> -- Lars
>>>
>>>
>>> - Original Message -
>>> From: Stack 
>>> To: user@hbase.apache.org
>>> Cc:
>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>> Subject: Re: Slow full-table scans
>>>
>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh  wrote:
 I am beginning to think that this is a configuration issue on my
 cluster. Do the following configuration files seem sane ?

 hbase-env.sh    https://gist.github.com/3345338

>>>
>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>
>>>
 hbase-site.xml    https://gist.github.com/3345356

>>>
>>> This is all defaults effectively.   I don't see any of the configs.
>>> recommended by the performance section of the reference guide and/or
>>> those suggested by the GBIF blog.
>>>
>>> You don't answer LarsH's query about where you see the 4% difference.
>>>
>>> How many regions in your table?  Whats the HBase Master UI look like
>>> when this scan is running?
>>> St.Ack
>>>


Re: issues copying data from one table to another

2012-08-21 Thread Norbert Burger
On Sat, Aug 18, 2012 at 7:14 AM, Michael Segel
 wrote:

Thanks.

> Just out of curiosity, what would happen if you could disable the table, 
> alter the table's max file size and then attempted to merge regions?  Note: 
> I've never tried this, don't know if its possible, just thinking outside of 
> the box...

Good idea.  In this case, I'm free to disable the old, region-full
table.  Unfortunately, I've already started writing data into the
newer, lower-region-count table, so at some point I'll need to export
the data anyway.

Does it make sense that these perf issues are caused by using the
HBase client API (vs. bulk export)?  My next thought was to write a
custom mapper for importtsv, as Anil suggested

Norbert


Re: issues copying data from one table to another

2012-08-21 Thread Norbert Burger
On Fri, Aug 17, 2012 at 4:09 PM, anil gupta  wrote:
> If you want to customize the bulkloader then you can write your own mapper
> to define the business logic for loading. You need to specify the mapper at
> the time of running importsv by using:

Thanks, Anil.  I had that seen that section of the HBase book, but
glossed over the mapper class property until you pointed it out.

Norbert


Re: HDFS-918 satus? Use single Selector and small thread pool to replace many instances of BlockSender for reads

2012-08-21 Thread Jean-Daniel Cryans
AFAIK Jay is doing other stuff, maybe Todd is interested in picking it
up but he's pretty busy too.

Feel free to apply it locally :)

J-D

On Mon, Aug 20, 2012 at 9:18 PM, jlei liu  wrote:
>  I was wondering if there was any movement on any of these HDFS tickets for
> HBase.
> The HDFS-918 is the ticket, but the last comment by Todd Lipcon in June
> 2011.
>
> I think this is good idea, that can  improve the pread performance of HBase.
>
> I want to apply the patch in hadoop-0.20.2-cdh3u5 version, can  I do it ?
>
>
> Thanks,
>
> LiuLei


Re: When I use secure hbase client create table, throws accessDeniedException 'user is null'

2012-08-21 Thread Andrew Purtell
What version of HBase?

You have this in your client and configurations?


 hbase.rpc.engine
 org.apache.hadoop.hbase.ipc.SecureRpcEngine



 hadoop.security.authentication
 kerberos



 hbase.security.authentication
 kerberos


Because "user null" usually means you haven't configured use of the
SecureRpcEngine.

An example working secure configuration:
https://github.com/apurtell/tm-ec2-demo


On Tue, Aug 21, 2012 at 4:43 AM, Pan,Jinyu  wrote:

> When I use access secure hbase, the client throws AccessDeniedException
> 'Insufficient permissions for user 'null' (global, action=CREATE)'
>
> Why? How to avoid it?
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)


Re: Using HBase serving to replace memcached

2012-08-21 Thread Lin Ma
Thanks Zahoor,

> If there is no bloom... you have to load every block and scan to find if
the row exists..

I could be wrong. I think HFile index block (which is located at the end of
HFile) is a binary search tree containing all row-key values (of the HFile)
in the binary search tree. Searching a specific row-key in the binary
search tree could easily find whether a row-key exists (some node in the
tree has the same row-key value) or not. Why we need load every block to
find if the row exists?

regards,
Lin

On Tue, Aug 21, 2012 at 11:56 PM, jmozah  wrote:

> >
> >
> > 1. After reading the materials you sent to me, I am confused how Bloom
> Filter could save I/O during random read. Supposing I am not using Bloom
> Filter, in order to find whether a row (or row-key) exists, we need to scan
> the index block which is at the end part of an HFile, the scan is in memory
> (I think index block is always in memory, please feel free to correct me if
> I am wrong) using binary search -- it should be pretty fast. With Bloom
> Filter, we could be a bit faster by looking up Bloom Filter bit vector in
> memory. Since both index block binary search and Bloom Filter bit vector
> search are doing in memory (no I/O is involved), what kinds of I/O is
> saved? :-)
> >
>
> If bloom says the Row *may* be present.. the block is loaded otherwise
> not...
> If there is no bloom... you have to load every block and scan to find if
> the row exists..
>
> This may incur more IO
>
>
> > 2.
> >
> > > One Hadoop job doing random reads is perfectly fine.  but , since you
> said "Handling directly user traffic"... i assumed you wanted to
> > > expose HBase independently to every client request, thereby having as
> many connections as the number of simultaneous req..
> >
> > Sorry I need to confirm again on this point. I think you mean
> establishing a new connection for each request is not good, using
> connection pool or asynchronous I/O is preferred?
> >
>
>
> Yes.


Re: Thrift2 interface

2012-08-21 Thread Stack
On Mon, Aug 20, 2012 at 6:18 PM, Joe Pallas  wrote:
> Anyone out there actively using the thrift2 interface in 0.94?  Thrift 
> bindings for C++ don’t seem to handle optional arguments too well (that is to 
> say, it seems that optional arguments are not optional).  Unfortunately, 
> checkAndPut uses an optional argument for value to distinguish between the 
> two cases (value must match vs no cell with that column qualifier).  Any 
> clues on how to work around that difficulty would be welcome.
>

If you make a patch, we'll commit it Joe.

Have you seen this?
https://github.com/facebook/native-cpp-hbase-client  Would it help?

St.Ack


Re: Using HBase serving to replace memcached

2012-08-21 Thread jmozah
> 
> 
> 1. After reading the materials you sent to me, I am confused how Bloom Filter 
> could save I/O during random read. Supposing I am not using Bloom Filter, in 
> order to find whether a row (or row-key) exists, we need to scan the index 
> block which is at the end part of an HFile, the scan is in memory (I think 
> index block is always in memory, please feel free to correct me if I am 
> wrong) using binary search -- it should be pretty fast. With Bloom Filter, we 
> could be a bit faster by looking up Bloom Filter bit vector in memory. Since 
> both index block binary search and Bloom Filter bit vector search are doing 
> in memory (no I/O is involved), what kinds of I/O is saved? :-)
> 

If bloom says the Row *may* be present.. the block is loaded otherwise not...
If there is no bloom... you have to load every block and scan to find if the 
row exists..

This may incur more IO 


> 2. 
> 
> > One Hadoop job doing random reads is perfectly fine.  but , since you said 
> > "Handling directly user traffic"... i assumed you wanted to
> > expose HBase independently to every client request, thereby having as many 
> > connections as the number of simultaneous req..
> 
> Sorry I need to confirm again on this point. I think you mean establishing a 
> new connection for each request is not good, using connection pool or 
> asynchronous I/O is preferred?
> 


Yes.

Re: Using HBase serving to replace memcached

2012-08-21 Thread Lin Ma
Thank you Zahoor,

Two more comments,

1. After reading the materials you sent to me, I am confused how Bloom
Filter could save I/O during random read. Supposing I am not using Bloom
Filter, in order to find whether a row (or row-key) exists, we need to scan
the index block which is at the end part of an HFile, the scan is in memory
(I think index block is always in memory, please feel free to correct me if
I am wrong) using binary search -- it should be pretty fast. With Bloom
Filter, we could be a bit faster by looking up Bloom Filter bit vector in
memory. Since both index block binary search and Bloom Filter bit vector
search are doing in memory (no I/O is involved), what kinds of I/O is
saved? :-)

2.

> One Hadoop job doing random reads is perfectly fine.  but , since you
said "Handling directly user traffic"... i assumed you wanted to
> expose HBase independently to every client request, thereby having as
many connections as the number of simultaneous req..

Sorry I need to confirm again on this point. I think you mean establishing
a new connection for each request is not good, using connection pool or
asynchronous I/O is preferred?

regards,
Lin

On Tue, Aug 21, 2012 at 10:45 PM, jmozah  wrote:

> >
> >
> >
> > 1. I know very basics of Bloom filters, which is used for detect whether
> an item is in a set. How to use Bloom filters in HBase to improve random
> read performance? Could you show me an example? Thanks.
>
> This will help omit loading the blocks (thereby saving IO and cache churn)
> which does not have the given row.
> For more on bloom, see
> 1 -
> https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf
> 2 - http://www.quora.com/How-are-bloom-filters-used-in-HBase
>
>
> > 2. "Also more client connections is one more issue that might infest
> you" -- supposing I am doing random read from a Hadoop job to access HBase,
> do you mean using multiple client connections from the Hadoop job is good
> or not good? Sorry I am a bit lost. :-)
>
> One Hadoop job doing random reads is perfectly fine.  but , since you said
> "Handling directly user traffic"... i assumed you wanted to expose HBase
> independently to every client request, thereby having as many connections
> as the number of simultaneous req..
>
>
> > 3. "asynchbase will help you" -- does HBase support asynchronous API?
> Sorry I cannot find it out. Appreciate if you could point me the APIs you
> are referring to.
>
>
> Not the default HTable API.  asynchbase is another client for Hbase. read
> more about asynchbase here (https://github.com/stumbleupon/asynchbase)
>
>


Re: What happened in hlog if data are deleted cuased by ttl?

2012-08-21 Thread jmozah
This helped me http://hadoop-hbase.blogspot.in/2011/12/deletion-in-hbase.html


./Zahoor
HBase Musings


On 14-Aug-2012, at 6:54 PM, Harsh J  wrote:

> Hi Yonghu,
> 
> A timestamp is stored along with each insert. The ttl is maintained at
> the region-store level. Hence, when the log replays, all entries with
> expired TTLs are automatically omitted.
> 
> Also, TTL deletions happen during compactions, and hence do not
> carry/need Delete events. When scanning a store file, TTL-expired
> entries are automatically skipped away.
> 
> On Tue, Aug 14, 2012 at 3:34 PM, yonghu  wrote:
>> My hbase version is 0.92. I tried something as follows:
>> 1.Created a table 'test' with 'course' in which ttl=5.
>> 2. inserted one row into the table. 5 seconds later, the row was deleted.
>> Later when I checked the log infor of 'test' table, I only found the
>> inserted information but not deleted information.
>> 
>> Can anyone tell me which information is written into hlog when data is
>> deleted by ttl or in this situation, no information is written into
>> the hlog. If there is no information of deletion in the log, how can
>> we guarantee the data recovered by log are correct?
>> 
>> Thanks!
>> 
>> Yong
> 
> 
> 
> -- 
> Harsh J



Thrift2 interface

2012-08-21 Thread Joe Pallas
Anyone out there actively using the thrift2 interface in 0.94?  Thrift bindings 
for C++ don’t seem to handle optional arguments too well (that is to say, it 
seems that optional arguments are not optional).  Unfortunately, checkAndPut 
uses an optional argument for value to distinguish between the two cases (value 
must match vs no cell with that column qualifier).  Any clues on how to work 
around that difficulty would be welcome.

Thanks.
joe



Re: Substring comparator for column key

2012-08-21 Thread jmozah
For filtering rows based on column key ( i hope that's what you asked), there 
is no direct filter as far as i know.
But i think you can use "ColumnPrefixFilter" which selects only those keys 
whose column name matches a particular prefix (some sort of substring matching 
using regex).   


./Zahoor
HBase Musings


On 21-Aug-2012, at 3:27 PM, Shagun Agarwal  wrote:

> Hi,
> 
> There is SubstringComparator which can be used with SingleColumnValueFilter 
> for substring filter however this works for key value. Is there any way to do 
> a substring filtering for column key?
> 
> Thanks
> Shagun



Re: Using HBase serving to replace memcached

2012-08-21 Thread jmozah
> 
> 
> 
> 1. I know very basics of Bloom filters, which is used for detect whether an 
> item is in a set. How to use Bloom filters in HBase to improve random read 
> performance? Could you show me an example? Thanks.

This will help omit loading the blocks (thereby saving IO and cache churn) 
which does not have the given row.
For more on bloom, see 
1 - 
https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf
2 - http://www.quora.com/How-are-bloom-filters-used-in-HBase


> 2. "Also more client connections is one more issue that might infest you" -- 
> supposing I am doing random read from a Hadoop job to access HBase, do you 
> mean using multiple client connections from the Hadoop job is good or not 
> good? Sorry I am a bit lost. :-)

One Hadoop job doing random reads is perfectly fine.  but , since you said 
"Handling directly user traffic"... i assumed you wanted to expose HBase 
independently to every client request, thereby having as many connections as 
the number of simultaneous req..


> 3. "asynchbase will help you" -- does HBase support asynchronous API? Sorry I 
> cannot find it out. Appreciate if you could point me the APIs you are 
> referring to.


Not the default HTable API.  asynchbase is another client for Hbase. read more 
about asynchbase here (https://github.com/stumbleupon/asynchbase) 



Re: What happened in hlog if data are deleted cuased by ttl?

2012-08-21 Thread yonghu
Thanks for your response. Can you tell me how the data is deleted due
to the ttl? Which module in HBase will trigger deletion? You mentioned
the scanner, does it mean the scanner will scan the store file
periodically and then deletes the data which expire?

regards!

Yong

On Thu, Aug 16, 2012 at 6:16 AM, Ramkrishna.S.Vasudevan
 wrote:
> Hi
>
> Just to add on,  The HLog is just an edit log.  Any transaction updates(
> Puts/Deletes) are just added to HLog.  It is the Scanner that takes care of
> the TTL part which is calculated from the TTL configured at the column
> family(Store) level.
>
> Regards
> Ram
>
>> -Original Message-
>> From: Harsh J [mailto:ha...@cloudera.com]
>> Sent: Tuesday, August 14, 2012 8:51 PM
>> To: user@hbase.apache.org
>> Subject: Re: What happened in hlog if data are deleted cuased by ttl?
>>
>> Yes, TTL deletions are done only during compactions. They aren't
>> "Deleted" in the sense of what a Delete insert signifies, but are
>> rather eliminated in the write process when new
>> storefiles are written out - if the value being written to the
>> compacted store has already expired.
>>
>> On Tue, Aug 14, 2012 at 8:40 PM, yonghu  wrote:
>> > Hi Hars,
>> >
>> > Thanks for your reply. If I understand you right, it means the ttl
>> > deletion will not reflect in log.
>> >
>> > On Tue, Aug 14, 2012 at 3:24 PM, Harsh J  wrote:
>> >> Hi Yonghu,
>> >>
>> >> A timestamp is stored along with each insert. The ttl is maintained
>> at
>> >> the region-store level. Hence, when the log replays, all entries
>> with
>> >> expired TTLs are automatically omitted.
>> >>
>> >> Also, TTL deletions happen during compactions, and hence do not
>> >> carry/need Delete events. When scanning a store file, TTL-expired
>> >> entries are automatically skipped away.
>> >>
>> >> On Tue, Aug 14, 2012 at 3:34 PM, yonghu 
>> wrote:
>> >>> My hbase version is 0.92. I tried something as follows:
>> >>> 1.Created a table 'test' with 'course' in which ttl=5.
>> >>> 2. inserted one row into the table. 5 seconds later, the row was
>> deleted.
>> >>> Later when I checked the log infor of 'test' table, I only found
>> the
>> >>> inserted information but not deleted information.
>> >>>
>> >>> Can anyone tell me which information is written into hlog when data
>> is
>> >>> deleted by ttl or in this situation, no information is written into
>> >>> the hlog. If there is no information of deletion in the log, how
>> can
>> >>> we guarantee the data recovered by log are correct?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Yong
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>>
>>
>>
>> --
>> Harsh J
>


Re: Using HBase serving to replace memcached

2012-08-21 Thread Lin Ma
Thanks for the reply, Zahoor.

Some more comments,

1. I know very basics of Bloom filters, which is used for detect whether an
item is in a set. How to use Bloom filters in HBase to improve random read
performance? Could you show me an example? Thanks.
2. "Also more client connections is one more issue that might infest you"
-- supposing I am doing random read from a Hadoop job to access HBase, do
you mean using multiple client connections from the Hadoop job is good or
not good? Sorry I am a bit lost. :-)
3. "asynchbase will help you" -- does HBase support asynchronous API? Sorry
I cannot find it out. Appreciate if you could point me the APIs you are
referring to.

regards,
Lin

On Tue, Aug 21, 2012 at 6:55 PM, J Mohamed Zahoor  wrote:

> Again. if your data is so huge that it is much larger than the available
> RAM, you might want to rethink.
> There are some configs in HBase that will help you in random read
> scenarios... like Bloom filters etc.
> Also more client connections is one more issue that might infest you...
> where connection pooling or asynchbase will help you.
>
> ./Zahoor
>
>
> On Tue, Aug 21, 2012 at 12:56 AM, Asif Ali  wrote:
>
> > I've used memcached heavily in such scenarios and all such data is always
> > in Memory.
> >
> > Memcached definitely is a great solution for this and scales very well.
> But
> > keep in mind - it is not consistent. Which means there are some requests
> > which will be handled incorrectly.
> >
> > Memcached is great but also look at Guava cache for similar use cases.
> >
> > Asif Ali
> >
> >
> > On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma  wrote:
> >
> > > Thank you Drew. I like your reply, especially blocking cache nature
> > > provided by HBase. A quick question, for traditional memcached, all of
> > the
> > > items are in memory, no disk is used, correct?
> > >
> > > regards,
> > > Lin
> > >
> > > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke 
> > > wrote:
> > >
> > > > I'd say if the memcached model is working for you, stick with it.
> > > > HBase (currently) caches whole blocks. With cache blocks enabled you
> > > > can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> > > > However there's a catch. Once you reach the point where your tables
> > > > are so large they can't all sit in memory at the same time you'll see
> > > > a behavior change. User traffic tends to be very random access which,
> > > > with block caching, can cause a lot of thrashing with frequent cache
> > > > evictions. We've seen this bring our cluster to it's knees.
> > > >
> > > > IMHO a better model is persist things in HBase and then cache things
> > > > with memcached just as you would with any other data store. If you're
> > > > looking for a spiffy memcached replacement I'd recommend checking out
> > > > Redis.
> > > >
> > > >
> > > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma  wrote:
> > > > > Hello guys,
> > > > >
> > > > > In your experience, is it practical to use HBase directly for
> > serving?
> > > > > Saying handle directly user traffic (tens of thousands QPS scale)
> > > behind
> > > > > Apache, and replace the role of memcached? I am not sure whether
> > there
> > > > are
> > > > > any known panic to replace memcached by using HBase? One issue I
> > could
> > > > > think about is for a specific row range, only one active region
> > server
> > > > > could handle the request, but in memcached, we can setup several
> > > > memcached
> > > > > instance with duplicate content (all of them are active) to serve
> the
> > > > same
> > > > > purpose under a VIP which could achieve better performance and
> > > > scalability.
> > > > >
> > > > > Any advice or reference documents are appreciated. Thanks.
> > > > >
> > > > > regards,
> > > > > Lin
> > > >
> > >
> >
>


Re: Using HBase serving to replace memcached

2012-08-21 Thread Lin Ma
Thanks Asif,

For your comments, "Which means there are some requests which will be
handled incorrectly.", could you show me an example about what do you mean
"handled incorrectly"?

regards,
Lin

On Tue, Aug 21, 2012 at 3:26 AM, Asif Ali  wrote:

> I've used memcached heavily in such scenarios and all such data is always
> in Memory.
>
> Memcached definitely is a great solution for this and scales very well. But
> keep in mind - it is not consistent. Which means there are some requests
> which will be handled incorrectly.
>
> Memcached is great but also look at Guava cache for similar use cases.
>
> Asif Ali
>
>
> On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma  wrote:
>
> > Thank you Drew. I like your reply, especially blocking cache nature
> > provided by HBase. A quick question, for traditional memcached, all of
> the
> > items are in memory, no disk is used, correct?
> >
> > regards,
> > Lin
> >
> > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke 
> > wrote:
> >
> > > I'd say if the memcached model is working for you, stick with it.
> > > HBase (currently) caches whole blocks. With cache blocks enabled you
> > > can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> > > However there's a catch. Once you reach the point where your tables
> > > are so large they can't all sit in memory at the same time you'll see
> > > a behavior change. User traffic tends to be very random access which,
> > > with block caching, can cause a lot of thrashing with frequent cache
> > > evictions. We've seen this bring our cluster to it's knees.
> > >
> > > IMHO a better model is persist things in HBase and then cache things
> > > with memcached just as you would with any other data store. If you're
> > > looking for a spiffy memcached replacement I'd recommend checking out
> > > Redis.
> > >
> > >
> > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma  wrote:
> > > > Hello guys,
> > > >
> > > > In your experience, is it practical to use HBase directly for
> serving?
> > > > Saying handle directly user traffic (tens of thousands QPS scale)
> > behind
> > > > Apache, and replace the role of memcached? I am not sure whether
> there
> > > are
> > > > any known panic to replace memcached by using HBase? One issue I
> could
> > > > think about is for a specific row range, only one active region
> server
> > > > could handle the request, but in memcached, we can setup several
> > > memcached
> > > > instance with duplicate content (all of them are active) to serve the
> > > same
> > > > purpose under a VIP which could achieve better performance and
> > > scalability.
> > > >
> > > > Any advice or reference documents are appreciated. Thanks.
> > > >
> > > > regards,
> > > > Lin
> > >
> >
>


Re: When I use secure hbase client create table, throws accessDeniedException 'user is null'

2012-08-21 Thread Pan,Jinyu
Yes, I've followed the steps,  and still can't create table but can scan table, 

Is there any way to locate the problem?


-邮件原件-
发件人: Sonal Goyal [mailto:sonalgoy...@gmail.com] 
发送时间: 2012年8月21日 20:16
收件人: user@hbase.apache.org
主题: Re: When I use secure hbase client create table, throws 
accessDeniedException 'user is null'

Are you following the steps at http://hbase.apache.org/book/security.html ?

Best Regards,
Sonal
Crux: Reporting for HBase 
Nube Technologies 







On Tue, Aug 21, 2012 at 5:13 PM, Pan,Jinyu  wrote:

> When I use access secure hbase, the client throws 
> AccessDeniedException 'Insufficient permissions for user 'null' (global, 
> action=CREATE)'
>
> Why? How to avoid it?
>
>


Re: When I use secure hbase client create table, throws accessDeniedException 'user is null'

2012-08-21 Thread Sonal Goyal
Are you following the steps at http://hbase.apache.org/book/security.html ?

Best Regards,
Sonal
Crux: Reporting for HBase 
Nube Technologies 







On Tue, Aug 21, 2012 at 5:13 PM, Pan,Jinyu  wrote:

> When I use access secure hbase, the client throws AccessDeniedException
> 'Insufficient permissions for user 'null' (global, action=CREATE)'
>
> Why? How to avoid it?
>
>


When I use secure hbase client create table, throws accessDeniedException 'user is null'

2012-08-21 Thread Pan,Jinyu
When I use access secure hbase, the client throws AccessDeniedException 
'Insufficient permissions for user 'null' (global, action=CREATE)'

Why? How to avoid it?



Re: Using HBase serving to replace memcached

2012-08-21 Thread J Mohamed Zahoor
Again. if your data is so huge that it is much larger than the available
RAM, you might want to rethink.
There are some configs in HBase that will help you in random read
scenarios... like Bloom filters etc.
Also more client connections is one more issue that might infest you...
where connection pooling or asynchbase will help you.

./Zahoor


On Tue, Aug 21, 2012 at 12:56 AM, Asif Ali  wrote:

> I've used memcached heavily in such scenarios and all such data is always
> in Memory.
>
> Memcached definitely is a great solution for this and scales very well. But
> keep in mind - it is not consistent. Which means there are some requests
> which will be handled incorrectly.
>
> Memcached is great but also look at Guava cache for similar use cases.
>
> Asif Ali
>
>
> On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma  wrote:
>
> > Thank you Drew. I like your reply, especially blocking cache nature
> > provided by HBase. A quick question, for traditional memcached, all of
> the
> > items are in memory, no disk is used, correct?
> >
> > regards,
> > Lin
> >
> > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke 
> > wrote:
> >
> > > I'd say if the memcached model is working for you, stick with it.
> > > HBase (currently) caches whole blocks. With cache blocks enabled you
> > > can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> > > However there's a catch. Once you reach the point where your tables
> > > are so large they can't all sit in memory at the same time you'll see
> > > a behavior change. User traffic tends to be very random access which,
> > > with block caching, can cause a lot of thrashing with frequent cache
> > > evictions. We've seen this bring our cluster to it's knees.
> > >
> > > IMHO a better model is persist things in HBase and then cache things
> > > with memcached just as you would with any other data store. If you're
> > > looking for a spiffy memcached replacement I'd recommend checking out
> > > Redis.
> > >
> > >
> > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma  wrote:
> > > > Hello guys,
> > > >
> > > > In your experience, is it practical to use HBase directly for
> serving?
> > > > Saying handle directly user traffic (tens of thousands QPS scale)
> > behind
> > > > Apache, and replace the role of memcached? I am not sure whether
> there
> > > are
> > > > any known panic to replace memcached by using HBase? One issue I
> could
> > > > think about is for a specific row range, only one active region
> server
> > > > could handle the request, but in memcached, we can setup several
> > > memcached
> > > > instance with duplicate content (all of them are active) to serve the
> > > same
> > > > purpose under a VIP which could achieve better performance and
> > > scalability.
> > > >
> > > > Any advice or reference documents are appreciated. Thanks.
> > > >
> > > > regards,
> > > > Lin
> > >
> >
>


Substring comparator for column key

2012-08-21 Thread Shagun Agarwal
Hi,

There is SubstringComparator which can be used with SingleColumnValueFilter for 
substring filter however this works for key value. Is there any way to do a 
substring filtering for column key?

Thanks
Shagun


Re: region servers failing due to bad datanode

2012-08-21 Thread Rajesh M
Hi,

I work in the same team as the OP and filling in for him today.

I ran the following command on the hbase master node - bin/hadoop fsck
/hbase -files -blocks | grep blk_-7841650651979512601_775949
There was no output returned. So it looks like the block does not exist. I
verified that this command returned outputs for various other existing
blocks.

I also checked the directory
-  /hbase/.logs/,
60020,1345222869339/%2C60020%2C1345222869339.1345420758726.
That directory contains only HLog files.

Am I missing something here?

Let me know what other information you need to help diagnose the issue.

- Rajesh

On Mon, Aug 20, 2012 at 8:21 PM, Khang Pham  wrote:

> Hi,
>
> Can you go to HDFS and check if you have the file:blk_-7841650651979512601_
> 775949 and its size ?
>
> Its location is probably somewhere in /hbase/.logs/,
> 60020,1345222869339/%2C60020%2C1345222869339.1345420758726.
> blk_-7841650651979512601_775949
>
> -- Khang
> On Mon, Aug 20, 2012 at 9:23 PM, prem yadav  wrote:
>
> > blk_-7841650651979512601_775949
> >
>