Re: Setting TTL at the row level

2017-06-23 Thread yonghu
I did not quite understand what you mean by "row timestamp"? As far as I
know, timestamp is associated to each data version (cell). Will you store
multiple data versions in a single column?



On Thu, Jun 22, 2017 at 4:35 AM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Why not using the cell level ttl?
>
> Le 2017-06-21 2:35 PM, "Vladimir Rodionov"  a
> écrit :
>
> > Should work
> >
> > On Wed, Jun 21, 2017 at 11:31 AM,  wrote:
> >
> > > Hi all,
> > >
> > > I know it is possible to set TTL in HBase at the column family level -
> > > which makes HBase delete rows in the column family when they reach a
> > > certain age.
> > >
> > > Rather than expire a row after it's reached a certain age, I would like
> > to
> > > expire each specific row at a specific time in the future (I.e. set
> > expiry
> > > at the row level, rather than at the column family level). To achieve
> > this,
> > > I am planning on setting the column family TTL to something very short
> > > (e.g. 1 minute) and then when I write my rows, I will set the row
> > timestamp
> > > to [current datetime + time until I want row to expire]. Since HBase
> uses
> > > row timestamp for TTL, this should let me effectively set TTL on the
> row
> > > level.
> > >
> > > Will this work? Is there any reason not to do this?
> > >
> > > Thanks!
> > > Josh
> > >
> > >
> > >
> >
>


Re: multiple data versions vs. multiple rows?

2015-01-20 Thread yonghu
I think we need to take a look different situations.

1. One column gets frequently updated and the others not. If we use row
representation, we will include the unchanged data value for each tuple.
This may cause a large data redundancy. So, I think it can explain why in
my test the multiple data version approach is better than multiple row
approach.

2. All columns get even updates. Hence, there will be not much data volume
difference between these two, as each data version is actually stored as a
key-value pair. In this situation, the performance between these two
approaches will not be significant.

Yong

On Tue, Jan 20, 2015 at 8:16 AM, Serega Sheypak 
wrote:

> does performance should differ significantly if row value size is small and
> we don't have too much versions.
> Assume, that a pack of versions for key is less than recommended HFile
> block (8KB to 1MB
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html
> ),
> which is minimal "read unit", should we see any difference at all?
> Am I right?
>
>
> 2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari :
>
> > Hi Yong,
> >
> > If you want to compare the performances, you need to run way bigger and
> > longer tests. Dont run them in parallete. Run them at least 10 time each
> to
> > make sure you have a good trend. Is the difference between the 2
> > significant? It should not.
> >
> > JM
> >
> > 2015-01-19 15:17 GMT-05:00 yonghu :
> >
> > > Hi,
> > >
> > > Thanks for your suggestion. I have already considered the first issue
> > that
> > > one row  is not allowed to be split between 2 regions.
> > >
> > > However, I have made a small scan-test with MapReduce. I first created
> a
> > > table t1 with 1 million rows and allowed each column to store 10 data
> > > versions. Then, I translated t1 into t2 in which multiple data versions
> > in
> > > t1 were transformed into multiple rows in t2. I wrote two MapReduce
> > > programs to scan t1 and t2 individually. What I got is the table
> scanning
> > > time of t1 is shorter than t2. So, I think for performance reason,
> > multiple
> > > data versions may be a better option than multiple rows.
> > >
> > > But just as you said, which approach to use depends on how many
> > historical
> > > events you want to keep.
> > >
> > > regards!
> > >
> > > Yong
> > >
> > >
> > > On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
> > > jean-m...@spaggiari.org> wrote:
> > >
> > > > Hi Yong,
> > > >
> > > > A row will not split between 2 regions. If you plan having thousands
> of
> > > > versions, based on the size of your data, you might end up having a
> row
> > > > bigger than your preferred region size.
> > > >
> > > > If you plan just keep few versions of the history to have a look at
> > it, I
> > > > will say go with it. If you plan to have one million version because
> > you
> > > > want to keep all the events history, go with the row approach.
> > > >
> > > > You can also consider going with the Column Qualifier approach. This
> > has
> > > > the same constraint as the versions regarding the split in 2 regions,
> > but
> > > > it might me easier to manage and still give you the consistency of
> > being
> > > > within a row.
> > > >
> > > > JM
> > > >
> > > > 2015-01-19 14:28 GMT-05:00 yonghu :
> > > >
> > > > > Dear all,
> > > > >
> > > > > I want to record the user history data. I know there exists two
> > > options,
> > > > > one is to store user events in a single row with multiple data
> > versions
> > > > and
> > > > > the other one is to use multiple rows. I wonder which one is better
> > for
> > > > > performance?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Yong
> > > > >
> > > >
> > >
> >
>


Re: multiple data versions vs. multiple rows?

2015-01-19 Thread yonghu
Hi,

Thanks for your suggestion. I have already considered the first issue that
one row  is not allowed to be split between 2 regions.

However, I have made a small scan-test with MapReduce. I first created a
table t1 with 1 million rows and allowed each column to store 10 data
versions. Then, I translated t1 into t2 in which multiple data versions in
t1 were transformed into multiple rows in t2. I wrote two MapReduce
programs to scan t1 and t2 individually. What I got is the table scanning
time of t1 is shorter than t2. So, I think for performance reason, multiple
data versions may be a better option than multiple rows.

But just as you said, which approach to use depends on how many historical
events you want to keep.

regards!

Yong


On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Hi Yong,
>
> A row will not split between 2 regions. If you plan having thousands of
> versions, based on the size of your data, you might end up having a row
> bigger than your preferred region size.
>
> If you plan just keep few versions of the history to have a look at it, I
> will say go with it. If you plan to have one million version because you
> want to keep all the events history, go with the row approach.
>
> You can also consider going with the Column Qualifier approach. This has
> the same constraint as the versions regarding the split in 2 regions, but
> it might me easier to manage and still give you the consistency of being
> within a row.
>
> JM
>
> 2015-01-19 14:28 GMT-05:00 yonghu :
>
> > Dear all,
> >
> > I want to record the user history data. I know there exists two options,
> > one is to store user events in a single row with multiple data versions
> and
> > the other one is to use multiple rows. I wonder which one is better for
> > performance?
> >
> > Thanks!
> >
> > Yong
> >
>


multiple data versions vs. multiple rows?

2015-01-19 Thread yonghu
Dear all,

I want to record the user history data. I know there exists two options,
one is to store user events in a single row with multiple data versions and
the other one is to use multiple rows. I wonder which one is better for
performance?

Thanks!

Yong


Strange behavior when using MapReduce to process HBase.

2014-11-25 Thread yonghu
Hi,

I write a copyTable mapreduce program. My hbase version is 0.94.16. Several
rows in the source table contain multiple data versions. The Map function
looks like follows:

public void map(ImmutableBytesWritable rowKey, Result res, Context context)
throws IOException, InterruptedException{
for(KeyValue kv : res.list()){
Put put = new Put(rowKey.get());
put.add(kv.getFamily(), kv.getQualifier(), kv.getTimestamp(),
kv.getValue());
context.write(null, put);
}

First, I did not set the timestamp, just using put.add(kv.getFamily(),
kv.getQualifier(), kv.getValue()). However, this approach will only add the
latest data version which means the older version are overwritten, even
each of them is issued with a separate Put command.

After I add the timestamp to each data version (cell) (the code I show
above), I can get multiple data versions.

The only explanation I can think why this happens is that HBase creates the
same timestamp for all the data versions so older values are overwritten.
But what I cannot understand is each cell is issued with an individual Put
command. Comparing to the situation when clients explicitly issue Put,
HBase will generate a distinct timestamp to each cell. This behavior seems
not be supported when utilizing MapReduce.

regards!

Yong


Is it possible to get the table name at the Map phase?

2014-11-21 Thread yonghu
Hello all,

I want to implement difference operator by using MapReduce. I read two
tables by using MultiTableInputFormat. In the Map phase, I need to tag the
name of table into each row, but how can I get the table name?

One way I can think is to create HTable instance for each table in the
setup() function. And in the map function, call get method to check if the
input row is belonged to table_1 or table_2. However, this approach is slow.

Can someone share some different ideas?

regards!

Yong


It is possible to indicate a specific key range into a specific node?

2014-11-02 Thread yonghu
Dear All,

Suppose that I have a key range from 1 to 100 and want to store 1-50 in the
first node and 51-100 in the second node. How can I do this in Hbase?

regards!

Yong


A use case for ttl deletion?

2014-09-26 Thread yonghu
Hello,

Can anyone give me a concrete use case for ttl deletions? I mean in which
situation we should set ttl property?

regards!

Yong


Re: hbase memstore size

2014-08-06 Thread yonghu
I did not quite understand your problem. You store your data in HBase, and
I guess later you also will read data from it. Generally, HBase will first
check if the data exist in memstore, if not, it will check the disk. If you
set the memstore to 0, it denotes every read will directly forward to disk.
How heavy will be the I/O cost? Moreover, you can think memstore as a
buffer management in RDBMS.


On Tue, Aug 5, 2014 at 5:54 AM, Alex Newman  wrote:

> Could you explain a bit more of why you don't want a memstore? I can't see
> why it is harmful. Sorry to be dense.
> On Aug 3, 2014 11:24 AM, "Ozhan Gulen"  wrote:
>
> > Hello,
> > In our hbase cluster memstore flush size is 128 mb. And to insert data to
> > tables, we only use bulk load tool. Since bulk loading bypasses
> memstores,
> > they are never used, so we want to minimize memstore flush size. But
> > memstore flush size is used in many important calculations in hbase such
> > that;
> >
> > region split size = Min (R^2 * “hbase.hregion.memstore.flush.size”,
> > “hbase.hregion.max.filesize”)
> >
> > So setting memstore value smaller or "0" for example,  results in some
> > other problems.
> > What do you suggest us in that case. Setting memstore size to 128 holds
> > some memory for tens of regions in region server and we want to get rid
> of
> > it.
> > Thanks a lot.
> >
> > ozhan
> >
>


Re: HBase appends

2014-07-22 Thread yonghu
Hi,

If a author does not have hundreds of publications, you can directly write
in one column. Hence, your column will contain multiple data versions. The
default data version is 3 but you can send more.


On Tue, Jul 22, 2014 at 4:20 AM, Ishan Chhabra 
wrote:

> Arun,
> You need to represent your data in a format such that you can simply add a
> byte[] to the end of the existing byte[] in a Cell and later read and
> decode it as a list.
>
> One way is to use the encode your data as protobuf. When you append a list
> of values in byte[] form in protobuf to an existing list byte[] and read
> the combined byte[], it is automatically recognized as one single list due
> to the way protobuf encodes lists.
>
> Another way to solve this problem is write a new column for each appended
> list and read all the columns and combine when reading. (I prefer this
> approach since the way Append is implemented internally, it can lead to
> high memstore usage).
>
>
> On Mon, Jul 21, 2014 at 5:43 PM, Arun Allamsetty <
> arun.allamse...@gmail.com>
> wrote:
>
> > Hi,
> >
> > If I have a one-to-many relationship in a SQL database (an author might
> > have written many books), and I want to denormalize it for writing in
> > HBase, I'll have a table with the Author as the row key and a *list* of
> > books as values.
> >
> > Now my question is how do I create a *list* such that I could just append
> > to it using the HBase Java API *Append* instead of doing a
> > read-modify-insert on a Java List object containing all the books.
> >
> > Thanks,
> > Arun
> >
>
>
>
> --
> *Ishan Chhabra *| Rocket Scientist | RocketFuel Inc.
>


Re: Moving older data versions to archive

2014-04-03 Thread yonghu
I think you can define coprocessors to do this. For example, for every put
command, you can keep the desired versions that you want, and later put the
older version into the other table or HDFS. Finally, either let Hbase
delete your stale data or let coprocessor do that for you. The problem of
this approach is the performance, as you see, every put command will
trigger coprocessor once.


On Thu, Apr 3, 2014 at 8:55 PM, Jean-Marc Spaggiari  wrote:

> Hey, that's one of the reasons I have opened HBASE-10115 but never got a
> chance to work on it. Basically, setup a TTL on the column, and with the
> hook, move the cells somewhere else.
>
> With current state, the only thing I see is a MR job which will run daily
> and move the older versions. Like, anything where version > 3 (as an
> example) and then delete it (or expire it with TTL, etc.). If unfortunatly
> don't think there is a "nice" solution to do that today.
>
> JM
>
>
> 2014-04-03 11:33 GMT-04:00 Mike Peterson :
>
> >  I need data versioning but want to keep older data in a separate
> location
> > (to keep the current data file denser). What would be the best way to do
> > that?
>


Re: single node's performance and cluster's performance

2014-04-03 Thread yonghu
I think the right understanding of this is it will slow down the data query
processing. You can think the RS who hit a heady I/O as a hotspot node. It
will not slow down the whole cluster, it will only slow down the data
applications which access the data from that RS.


On Thu, Apr 3, 2014 at 3:58 AM, Libo Yu  wrote:

> "having one such "slow" RS will make the whole cluster work slower
> (basically, at its speed)."
> http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
>
> From: yu_l...@hotmail.com
> To: user@hbase.apache.org
> Subject: single node's performance and cluster's performance
> Date: Wed, 2 Apr 2014 09:06:35 -0400
>
>
>
>
> Hi all,
>
> I read an  article yesterday about Hbase. It says if a region server
> has a performance hit by heavy IO, that will impact the whole cluster.
> So I want to know how a single node's performance issue will slow
> down all the cluster. Thanks.
>
> Libo
>
>
>


Re: LSM tree, SSTable and fractal tree

2014-02-28 Thread yonghu
Hbase uses LSM tree and SStable, not sure for fractal tree


On Fri, Feb 28, 2014 at 2:57 PM, Shahab Yunus wrote:

> http://www.slideshare.net/jaxlondon2012/hbase-advanced-lars-george
> http://hortonworks.com/hadoop/hbase/
>
> Regards,
> Shahab
>
>
> On Fri, Feb 28, 2014 at 8:36 AM, Vimal Jain  wrote:
>
> > Hi,
> > Which one of the storage structure does hbase uses ? Is it LSM tree ,
> > SSTable or fractal tree ?
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
> >
>


Re: When should we trigger a major compaction?

2014-02-21 Thread yonghu
Before you want to trigger major compaction, let's first explain why do we
need major compaction. The major compaction will cause
1. delete the data which is masked by tombstone;
2. delete the data which has expired ttl;
3. compact several small hfiles into a single larger one.

I didn't quite understand what do you mean by "use num of storefiles", why
this?


On Fri, Feb 21, 2014 at 10:50 AM, Ramon Wang  wrote:

> Hi Guys
>
> We disabled the automatically major compaction setting in our HBase
> cluster, so we want to something which can tell us when should we do major
> compaction, we were thinking to use "num of storefiles", but we cannot find
> any APIs or any JMX settings which can be used for it, please share us some
> ideas on how to do it, thanks in advance.
>
> Cheers
> Ramon
>


Re: TTL forever

2014-02-18 Thread yonghu
I also calculated the years of ttl, just for fun. :). But as Jean said,
default ttl is forever.


On Tue, Feb 18, 2014 at 2:05 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Hi Mohamed,
>
> Default value is MAX_VALUE, which is considered as "forever". So default
> TTL is NOT 69 years. default TTL IS forever.
>
> JM
>
>
> 2014-02-18 5:19 GMT-05:00 Mohamed Ghareb :
>
> > Many thanks, I got it
> > The default TTL over 69 years
> >
> > -Original Message-
> > From: lars hofhansl [mailto:la...@apache.org]
> > Sent: Tuesday, February 18, 2014 12:01 AM
> > To: user@hbase.apache.org
> > Subject: Re: TTL forever
> >
> > Just do not set any TTL, the default is "forever".
> >
> >
> >
> > 
> >  From: Mohamed Ghareb 
> > To: "user@hbase.apache.org" 
> > Sent: Monday, February 17, 2014 9:50 AM
> > Subject: TTL forever
> >
> >
> > I know that time to live it delete the cell content after the TTL date
> > setting when major compaction accured
> > If I need make setting TTL to live forever. How I do that
> >
>


Re: What's the Common Way to Execute an HBase Job?

2014-02-11 Thread yonghu
Hi,

To process the data in Hbase. You can have different options.

1. Java program using Hbase api;
2. MapReduce program;
3. High-level languages, such as Hive or Pig (built on top of MapReduce);
4. Phoenix also a High-level language (built based on coprocessor).

which one you should use depends on your requirements.

Yong


On Wed, Feb 12, 2014 at 7:18 AM, Ji ZHANG  wrote:

> Hi,
>
> I'm using the HBase Client API to connect to a remote cluster and do
> some operations. This project will certainly require hbase and
> hadoop-core jars. And my question is whether I should use 'java'
> command and handle all the dependencies (using maven shaded plugin, or
> set the classpath environment), or there's a magic utility command to
> handle all these for me?
>
> Take map-redcue job for an instance. Typically the main class will
> extend Configured and implement Tool. The job will be executed by
> 'hadoop jar' command and all environment and hadoop-core dependency
> are at hand. This approach also handles the common command line
> parsing for me, and I can easily get an instance of Configuration by
> 'this.getConf()';
>
> I'm wondering whether HBase provides the same utiliy command?
>
> Thanks,
> Jerry
>


Re: Newbie question: Rowkey design

2013-12-17 Thread yonghu
In my opinion, it really depends on your queries.

The first one achieves data locality. There is no additional data  transmit
between different nodes. But this strategy sacrifices parallelism and the
node which stores A will be a hot node if too many applications try to
access A.

The second approach gives you parallelism but you need somehow to merge the
data together to generate the final results.  So, you can see there is a
trade off between data locality and parallelism. So the performance of
query will be influenced by following factors:

1. data size;
2. data access frequency;
3. data access pattern, full scan or index scan;
4. network bandwidth.

So the best solution for one situation may not fit for the others.



On Tue, Dec 17, 2013 at 3:56 AM, Tao Xiao  wrote:

> Sometimes row key design is a trade-off issue between load-balance and
> query : if you design row key such that you can query it very fast and
> convenient, maybe the records are not spread evenly across the nodes; if
> you design row key such that the records are spread evenly across the
> nodes, maybe it's not convenient to query or impossible to get the record
> through row key directly (say you have a random number as the row key's
> prefix).
>
> You can have a look at secondary index. Secondary index is very helpful.
>
>
>
>
> 2013/12/16 Wilm Schumacher 
>
> > Hi,
> >
> > I'm a newbie to hbase and have a question on the rowkey design and I
> > hope this question isn't to newbie-like for this list. I have a question
> > which cannot be answered by knoledge of code but by experience with
> > large databases, thus this mail.
> >
> > For the sake of explaination I create a small example. Suppose you want
> > to design a small "blogging" plattform. You just want to store the name
> > of the user and a small text. And of course you want to get all postings
> > of one user.
> >
> > Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
> > that the length of the username is fixed). Now let's say the A,B,C and D
> > have N postings, and D has 6*N postings. BUT: the data of A is 3 times
> > more often fetched than the data from the other users each!
> >
> > If you create a hbase cluster with 10 nodes, every node is holding N
> > postings (of course I know, that the data is hold redundantly, but this
> > is not so important for the question).
> >
> > Rowkey design #1:
> > the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
> > The table just would be: "create 'postings' , 'text'"
> >
> > For this rowkey design the first node would hold the data of A, the
> > second of B, the third of C and the fourth to the tenth node the data of
> D.
> >
> > Fetching of data would be very easy, but half of the traffic would hit
> > the first node.
> >
> > Rowkey design #2
> > the rowkey would be random, e.g. an uuid. The table design would be now:
> > "create 'postings' , 'user' , 'text'"
> >
> > the fetching of the data would be a "real" map-reduce job, checking for
> > the user and emit etc..
> >
> > So, if a fetching takes place I have to do more computation cycles and
> > IO. But in this scenario all traffic would hit all 10 servers.
> >
> > If the number of N (number of postings) is large enough that the disk
> > space is critical, I'm also not able to adjust the key regions in a way
> > that e.g. the data of D is only on the last server and the key space of
> > A would span the first 5 nodes. Or making replication very broad (e.g.
> > 10 times in this case)
> >
> > So basically the question is: What's the better plan? Trying to avoid
> > computation cycles of map reducing and get the key design straight, or
> > trying to scale the computation, but doing more IO?
> >
> > I hope that the small example helped to make the question more vivid.
> >
> > Best wishes
> >
> > Wilm
> >
>


Re: Online/Realtime query with filter and join?

2013-11-29 Thread yonghu
The question is what you mean of "real-time". What is your performance
request? In my opinion, I don't think the MapReduce is suitable for the
real time data processing.


On Fri, Nov 29, 2013 at 9:55 AM, Azuryy Yu  wrote:

> you can try phoniex.
>  On 2013-11-29 3:44 PM, "Ramon Wang"  wrote:
>
> > Hi Folks
> >
> > It seems to be impossible, but I still want to check if there is a way we
> > can do "complex" query on HBase with "Order By", "JOIN".. etc like we
> have
> > with normal RDBMS, we are asked to provided such a solution for it, any
> > ideas? Thanks for your help.
> >
> > BTW, i think maybe impala from CDH would be a way to go, but haven't got
> > time to check it yet.
> >
> > Thanks
> > Ramon
> >
>


Re: How to create HTableInterface in coprocessor?

2013-10-24 Thread yonghu
Ok. I will give a try.

regards!

Yong


On Wed, Oct 23, 2013 at 11:53 PM, Ted Yu  wrote:

> Yong:
> I have attached the backport to HBASE-9819.
>
> If you can patch your build and see if it fixes the problem, that would be
> great.
>
>
> On Tue, Oct 22, 2013 at 2:58 PM, Ted Yu  wrote:
>
> > Yong:
> > There is unit test exercising CoprocessorEnvironment.getTable().
> > See
> src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java
> > :
> >
> >   HTableInterface table = e.getEnvironment().getTable(otherTable);
> >   Put p = new Put(new byte[] { 'a' });
> >   p.add(family, null, new byte[] { 'a' });
> >   table.put(put);
> >
> > If you can modify the above test to show how the exception is reproduced,
> > that would help us fully understand the case and verify the fix.
> >
> > Thanks
> >
> >
> > On Tue, Oct 22, 2013 at 12:51 PM, Ted Yu  wrote:
> >
> >> I logged HBASE-9819 to backport HBASE-8372 'Provide mutability to
> >> CompoundConfiguration' to 0.94
> >>
> >> If you have time, you can work on the backport.
> >>
> >> Cheers
> >>
> >>
> >> On Tue, Oct 22, 2013 at 11:56 AM, yonghu  wrote:
> >>
> >>> Hi Ted,
> >>>
> >>> This is because I tried different ways to generate a HTableInterface.
> >>>
> >>> One is as Gray mentioned, use RegionCoprocessorEnvironment "rce" to
> >>> create
> >>> a HTableInterface "td", but it did not work. So I commented it.
> >>>
> >>> Later I tried the approach which is suggested by
> >>> http://hbase.apache.org/book.html#client.connections. First create a
> >>> HConnection "hc", and then create a  HTableInterface "td". It still did
> >>> not
> >>> work.
> >>>
> >>> Both of them return the same error messages. Such as:
> >>>
> >>> ERROR: org.apache.hadoop.hbase.
> >>> client.RetriesExhaustedWithDetailsException: Failed 1 action:
> >>> org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor:
> >>>
> >>>
> 'org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionEnvironment@9a99eb
> >>> '
> >>> threw: 'java.lang.UnsupportedOperationException: Immutable
> Configuration'
> >>> and has been removedfrom the active coprocessor set.
> >>> at
> >>>
> >>>
> org.apache.hadoop.hbase.coprocessor.CoprocessorHost.handleCoprocessorThrowable(CoprocessorHost.java:740)
> >>> at
> >>>
> >>>
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:810)
> >>> .
> >>>
> >>> and indicate the errors caused by rce.getTable() or hc.getTable().
> >>>
> >>> regards!
> >>>
> >>> Yong
> >>>
> >>>
> >>> On Tue, Oct 22, 2013 at 8:42 PM, Ted Yu  wrote:
> >>>
> >>> > There're two types of exceptions. In the code below, I saw
> >>> rce.getTable()
> >>> > being commented out.
> >>> >
> >>> > Can you tell us the correlation between types of exception and
> >>> getTable()
> >>> > calls ?
> >>> >
> >>> > Thanks
> >>> >
> >>> >
> >>> > On Tue, Oct 22, 2013 at 11:24 AM, yonghu 
> >>> wrote:
> >>> >
> >>> > > public void prePut(ObserverContext e,
> >>> Put
> >>> > > put, WALEdit edit, boolean writeToWAL){
> >>> > > RegionCoprocessorEnvironment rce = e.getEnvironment();
> >>> > > HTableInterface td = null;
> >>> > > HTableDescriptor htd = hr.getTableDesc();
> >>> > > Configuration conf = rce.getConfiguration();
> >>> > > HConnection hc = null;
> >>> > > try {
> >>> > > hc = HConnectionManager.createConnection(conf);
> >>> > > } catch (ZooKeeperConnectionException e1) {
> >>> > > // TODO Auto-generated catch block
> >>> > > e1.printStackTrace();
> >>> > > }
> >>> > > try {
> >>> > > td = hc.getTable(Bytes.t

Re: How to create HTableInterface in coprocessor?

2013-10-22 Thread yonghu
Hi Ted,

This is because I tried different ways to generate a HTableInterface.

One is as Gray mentioned, use RegionCoprocessorEnvironment "rce" to create
a HTableInterface "td", but it did not work. So I commented it.

Later I tried the approach which is suggested by
http://hbase.apache.org/book.html#client.connections. First create a
HConnection "hc", and then create a  HTableInterface "td". It still did not
work.

Both of them return the same error messages. Such as:

ERROR: org.apache.hadoop.hbase.
client.RetriesExhaustedWithDetailsException: Failed 1 action:
org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor:
'org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionEnvironment@9a99eb'
threw: 'java.lang.UnsupportedOperationException: Immutable Configuration'
and has been removedfrom the active coprocessor set.
at
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.handleCoprocessorThrowable(CoprocessorHost.java:740)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:810)
.

and indicate the errors caused by rce.getTable() or hc.getTable().

regards!

Yong


On Tue, Oct 22, 2013 at 8:42 PM, Ted Yu  wrote:

> There're two types of exceptions. In the code below, I saw rce.getTable()
> being commented out.
>
> Can you tell us the correlation between types of exception and getTable()
> calls ?
>
> Thanks
>
>
> On Tue, Oct 22, 2013 at 11:24 AM, yonghu  wrote:
>
> > public void prePut(ObserverContext e, Put
> > put, WALEdit edit, boolean writeToWAL){
> > RegionCoprocessorEnvironment rce = e.getEnvironment();
> > HTableInterface td = null;
> > HTableDescriptor htd = hr.getTableDesc();
> > Configuration conf = rce.getConfiguration();
> > HConnection hc = null;
> > try {
> > hc = HConnectionManager.createConnection(conf);
> > } catch (ZooKeeperConnectionException e1) {
> > // TODO Auto-generated catch block
> > e1.printStackTrace();
> > }
> > try {
> > td = hc.getTable(Bytes.toBytes(tracking));
> > } catch (IOException e1) {
> > // TODO Auto-generated catch block
> > e1.printStackTrace();
> > }
> > try {
> > //td = rce.getTable(Bytes.toBytes(tracking));
> > Put p = new Put(put.getRow());
> > p.add(Bytes.toBytes("Value"), Bytes.toBytes("Current"),
> > Bytes.toBytes(1));
> > td.put(p);
> > } catch (IOException e2) {
> > // TODO Auto-generated catch block
> > e2.printStackTrace();
> > }
> > }
> >
> >
> > On Tue, Oct 22, 2013 at 8:20 PM, Ted Yu  wrote:
> >
> > > Can you show us your code around the following line ?
> > >
> CDCTrigger.TriggerForModification.prePut(TriggerForModification.java:51)
> > >
> > > The error was due to:
> > >
> > > public HTableInterface getTable(byte[] tableName, ExecutorService
> > pool)
> > > throws IOException {
> > >   if (managed) {
> > > throw new IOException("The connection has to be unmanaged.");
> > >   }
> > >
> > > Cheers
> > >
> > >
> > > On Tue, Oct 22, 2013 at 11:14 AM, yonghu 
> wrote:
> > >
> > > > Ted,
> > > >
> > > > Can you tell me how to dump the stack trace of HBase? By the way, I
> > check
> > > > the log of RegionServer. It has following error messages:
> > > >
> > > > java.io.IOException: The connection has to be unmanaged.
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:669)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:658)
> > > > at
> > > >
> > CDCTrigger.TriggerForModification.prePut(TriggerForModification.java:51)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:808)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doPreMutationHook(HRegion.java:2196)
> > > &

Re: How to create HTableInterface in coprocessor?

2013-10-22 Thread yonghu
public void prePut(ObserverContext e, Put
put, WALEdit edit, boolean writeToWAL){
RegionCoprocessorEnvironment rce = e.getEnvironment();
HTableInterface td = null;
HTableDescriptor htd = hr.getTableDesc();
Configuration conf = rce.getConfiguration();
HConnection hc = null;
try {
hc = HConnectionManager.createConnection(conf);
} catch (ZooKeeperConnectionException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
try {
td = hc.getTable(Bytes.toBytes(tracking));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
try {
//td = rce.getTable(Bytes.toBytes(tracking));
Put p = new Put(put.getRow());
p.add(Bytes.toBytes("Value"), Bytes.toBytes("Current"),
Bytes.toBytes(1));
td.put(p);
} catch (IOException e2) {
// TODO Auto-generated catch block
e2.printStackTrace();
}
}


On Tue, Oct 22, 2013 at 8:20 PM, Ted Yu  wrote:

> Can you show us your code around the following line ?
> CDCTrigger.TriggerForModification.prePut(TriggerForModification.java:51)
>
> The error was due to:
>
> public HTableInterface getTable(byte[] tableName, ExecutorService pool)
> throws IOException {
>   if (managed) {
> throw new IOException("The connection has to be unmanaged.");
>   }
>
> Cheers
>
>
> On Tue, Oct 22, 2013 at 11:14 AM, yonghu  wrote:
>
> > Ted,
> >
> > Can you tell me how to dump the stack trace of HBase? By the way, I check
> > the log of RegionServer. It has following error messages:
> >
> > java.io.IOException: The connection has to be unmanaged.
> > at
> >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:669)
> > at
> >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:658)
> > at
> > CDCTrigger.TriggerForModification.prePut(TriggerForModification.java:51)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:808)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doPreMutationHook(HRegion.java:2196)
> > at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2172)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3811)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at
> >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> >
> >
> > On Tue, Oct 22, 2013 at 8:07 PM, Ted Yu  wrote:
> >
> > > Yong:
> > > Can you post full stack trace so that we can diagnose the problem ?
> > >
> > > Cheers
> > >
> > >
> > > On Tue, Oct 22, 2013 at 11:01 AM, yonghu 
> wrote:
> > >
> > > > Gray,
> > > >
> > > > Finally, I saw the error messages. ERROR:
> > > > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> > > Failed
> > > > 1 action: org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor:
> > > >
> > > >
> > >
> >
> 'org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionEnvironment@303a60
> > > > '
> > > > threw: 'java.lang.UnsupportedOperationException: Immutable
> > Configuration'
> > > > and has been removed from the active coprocessor set.
> > > >
> > > > I will try different approach as Ted mentioned.
> > > >
> > > >
> > > > On Tue, Oct 22, 2013 at 7:49 PM, yonghu 
> wrote:
> > > >
> > > > > Gray
> > > > >
> > > > > Thanks for your response. I tried your approach. But it did not
> work.
> > > The
> > > > > HBase just stalled, no messages, nothing happened. By the way, my
> > hbase
> > > > > version is 0.94.12.
&

Re: How to create HTableInterface in coprocessor?

2013-10-22 Thread yonghu
Ted,

This is the stack trace that I got:

ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
Failed 1 action: org.apache.hadoop.hbase.DoNotRetryIOException:
Coprocessor:
'org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionEnvironment@9a99eb'
threw: 'java.lang.UnsupportedOperationException: Immutable Configuration'
and has been removedfrom the active coprocessor set.
at
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.handleCoprocessorThrowable(CoprocessorHost.java:740)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:810)
at
org.apache.hadoop.hbase.regionserver.HRegion.doPreMutationHook(HRegion.java:2196)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2172)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3811)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
Caused by: java.lang.UnsupportedOperationException: Immutable Configuration
at
org.apache.hadoop.hbase.regionserver.CompoundConfiguration.set(CompoundConfiguration.java:484)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.ensureZookeeperTrackers(HConnectionManager.java:721)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:986)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:961)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:251)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:243)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:671)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:658)
at
CDCTrigger.TriggerForModification.prePut(TriggerForModification.java:61)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:808)
... 9 more
: 1 time, servers with issues: hans-laptop:60020,


On Tue, Oct 22, 2013 at 8:14 PM, yonghu  wrote:

> Ted,
>
> Can you tell me how to dump the stack trace of HBase? By the way, I check
> the log of RegionServer. It has following error messages:
>
> java.io.IOException: The connection has to be unmanaged.
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:669)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:658)
> at
> CDCTrigger.TriggerForModification.prePut(TriggerForModification.java:51)
> at
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:808)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doPreMutationHook(HRegion.java:2196)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2172)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3811)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
>
>
> On Tue, Oct 22, 2013 at 8:07 PM, Ted Yu  wrote:
>
>> Yong:
>> Can you post full stack trace so that we can diagnose the problem ?
>>
>> Cheers
>>
>>
>> On Tue, Oct 22, 2013 at 11:01 AM, yonghu  wrote:
>>
>> > Gray,
>> >
>> > Finally, I saw the error messages. ERROR:
>> > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>> Failed
>> > 1 action: org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor:
>> >
>> >
>> 'org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionEnvironment@303a60
>> > '
>> > threw: 'java.lang.UnsupportedOperationException: Immutable
>> Configuration'
>> > and has been removed from the active coprocessor 

Re: How to create HTableInterface in coprocessor?

2013-10-22 Thread yonghu
Ted,

Can you tell me how to dump the stack trace of HBase? By the way, I check
the log of RegionServer. It has following error messages:

java.io.IOException: The connection has to be unmanaged.
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:669)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:658)
at
CDCTrigger.TriggerForModification.prePut(TriggerForModification.java:51)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:808)
at
org.apache.hadoop.hbase.regionserver.HRegion.doPreMutationHook(HRegion.java:2196)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2172)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3811)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)


On Tue, Oct 22, 2013 at 8:07 PM, Ted Yu  wrote:

> Yong:
> Can you post full stack trace so that we can diagnose the problem ?
>
> Cheers
>
>
> On Tue, Oct 22, 2013 at 11:01 AM, yonghu  wrote:
>
> > Gray,
> >
> > Finally, I saw the error messages. ERROR:
> > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed
> > 1 action: org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor:
> >
> >
> 'org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionEnvironment@303a60
> > '
> > threw: 'java.lang.UnsupportedOperationException: Immutable Configuration'
> > and has been removed from the active coprocessor set.
> >
> > I will try different approach as Ted mentioned.
> >
> >
> > On Tue, Oct 22, 2013 at 7:49 PM, yonghu  wrote:
> >
> > > Gray
> > >
> > > Thanks for your response. I tried your approach. But it did not work.
> The
> > > HBase just stalled, no messages, nothing happened. By the way, my hbase
> > > version is 0.94.12.
> > >
> > >
> > >
> > > On Tue, Oct 22, 2013 at 7:34 PM, Gary Helmling  > >wrote:
> > >
> > >> Within a coprocessor, you can just use the CoprocessorEnvironment
> > instance
> > >> passed to start() method or any of the pre/post hooks, and call
> > >> CoprocessorEnvironment.getTable(byte[] tablename).
> > >>
> > >>
> > >> On Tue, Oct 22, 2013 at 9:41 AM, Ted Yu  wrote:
> > >>
> > >> > Take a look at http://hbase.apache.org/book.html#client.connections,
> > >> > especially 9.3.1.1.
> > >> >
> > >> >
> > >> > On Tue, Oct 22, 2013 at 9:37 AM, yonghu 
> > wrote:
> > >> >
> > >> > > Hello,
> > >> > >
> > >> > > In the oldest verison of HBase , I can get the HTableInterface by
> > >> > > HTablePool.getTable() method. However, in the latest Hbase
> > >> > version0.94.12,
> > >> > > HTablePool is deprecated. So, I tried to use HConnectionManager to
> > >> create
> > >> > > HTableInterface, but it does not work. Can anyone tell me how to
> > >> create
> > >> > > HTableInterface in new HBase version? By the way, there is no
> error
> > >> > message
> > >> > > when I run coprocessor.
> > >> > >
> > >> > > regards!
> > >> > >
> > >> > > Yong
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>


Re: How to create HTableInterface in coprocessor?

2013-10-22 Thread yonghu
Gray,

Finally, I saw the error messages. ERROR:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
1 action: org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor:
'org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionEnvironment@303a60'
threw: 'java.lang.UnsupportedOperationException: Immutable Configuration'
and has been removed from the active coprocessor set.

I will try different approach as Ted mentioned.


On Tue, Oct 22, 2013 at 7:49 PM, yonghu  wrote:

> Gray
>
> Thanks for your response. I tried your approach. But it did not work. The
> HBase just stalled, no messages, nothing happened. By the way, my hbase
> version is 0.94.12.
>
>
>
> On Tue, Oct 22, 2013 at 7:34 PM, Gary Helmling wrote:
>
>> Within a coprocessor, you can just use the CoprocessorEnvironment instance
>> passed to start() method or any of the pre/post hooks, and call
>> CoprocessorEnvironment.getTable(byte[] tablename).
>>
>>
>> On Tue, Oct 22, 2013 at 9:41 AM, Ted Yu  wrote:
>>
>> > Take a look at http://hbase.apache.org/book.html#client.connections ,
>> > especially 9.3.1.1.
>> >
>> >
>> > On Tue, Oct 22, 2013 at 9:37 AM, yonghu  wrote:
>> >
>> > > Hello,
>> > >
>> > > In the oldest verison of HBase , I can get the HTableInterface by
>> > > HTablePool.getTable() method. However, in the latest Hbase
>> > version0.94.12,
>> > > HTablePool is deprecated. So, I tried to use HConnectionManager to
>> create
>> > > HTableInterface, but it does not work. Can anyone tell me how to
>> create
>> > > HTableInterface in new HBase version? By the way, there is no error
>> > message
>> > > when I run coprocessor.
>> > >
>> > > regards!
>> > >
>> > > Yong
>> > >
>> >
>>
>
>


Re: How to create HTableInterface in coprocessor?

2013-10-22 Thread yonghu
Gray

Thanks for your response. I tried your approach. But it did not work. The
HBase just stalled, no messages, nothing happened. By the way, my hbase
version is 0.94.12.



On Tue, Oct 22, 2013 at 7:34 PM, Gary Helmling  wrote:

> Within a coprocessor, you can just use the CoprocessorEnvironment instance
> passed to start() method or any of the pre/post hooks, and call
> CoprocessorEnvironment.getTable(byte[] tablename).
>
>
> On Tue, Oct 22, 2013 at 9:41 AM, Ted Yu  wrote:
>
> > Take a look at http://hbase.apache.org/book.html#client.connections ,
> > especially 9.3.1.1.
> >
> >
> > On Tue, Oct 22, 2013 at 9:37 AM, yonghu  wrote:
> >
> > > Hello,
> > >
> > > In the oldest verison of HBase , I can get the HTableInterface by
> > > HTablePool.getTable() method. However, in the latest Hbase
> > version0.94.12,
> > > HTablePool is deprecated. So, I tried to use HConnectionManager to
> create
> > > HTableInterface, but it does not work. Can anyone tell me how to create
> > > HTableInterface in new HBase version? By the way, there is no error
> > message
> > > when I run coprocessor.
> > >
> > > regards!
> > >
> > > Yong
> > >
> >
>


How to create HTableInterface in coprocessor?

2013-10-22 Thread yonghu
Hello,

In the oldest verison of HBase , I can get the HTableInterface by
HTablePool.getTable() method. However, in the latest Hbase version0.94.12,
HTablePool is deprecated. So, I tried to use HConnectionManager to create
HTableInterface, but it does not work. Can anyone tell me how to create
HTableInterface in new HBase version? By the way, there is no error message
when I run coprocessor.

regards!

Yong


When a log file will be transfered from /.log folder to /.oldlogs folder?

2013-10-17 Thread yonghu
Hello,

I saw some descriptions that when data modifications (like Put or Delete)
have already been persistent on disk, the log file will be transfered from
/.log folder to /.oldlogs folder. However, I made a simple test. I first
created a table and inserted several data and then I used flush command to
flush all the data from memory into disk. But After that, I did not see any
log files in  /.oldlogs. So I wonder when HBase will transfer log files
from  /.log folder to /.oldlogs folder? Is any parameter I can set? By the
way, I did not fide any information in hbase-default.xml.

regards!

Yong


Re: How to understand the TS of each data version?

2013-09-28 Thread yonghu
Hi, Ted

Thanks for your response. This is also the way I use to avoid the problem.

regards!

Yong


On Sat, Sep 28, 2013 at 4:31 PM, Ted Yu  wrote:

> Can you make NetworkSpeed as column family ?
>
> This way you can treat individual suppliers as columns within the column
> family.
> So for "user Tom has a new supplier d instead of supplier c and its speed
> is 15K":
>
> rk   NetworkSpeed
>   cd
> Tom   {10K:1}
> Tom {15K:2}
>
> In the example above, the numbers after colon are TS. If the speed is
> unknown, you can store a special marker in the Cell.
> I used two rows, but as you said, the two Cells can be written using one
> RPC call.
>
> This way, NetworkSupplier column is not needed.
>
> Cheers
>
>
> On Fri, Sep 27, 2013 at 3:04 PM, yonghu  wrote:
>
> > To Ted,
> >
> > --"Can you tell me why readings corresponding to different timestamps
> would
> > appear in the same row ?"
> >
> > Is that mean the data versions which belong to the same row should at
> least
> > have the same timestamps?
> >
> > For adding a row into HBase, I can use single Put instance, for example,
> > Put put = new Put("tom") and put.addColumn("Network:Supplier","c" ),
> > put.addColmn("Network:Supplier","d"). And hence the data versions will
> have
> > the same TS.
> >
> > However, I can also use multiple Put instances, each Put instance for
> > single data version. For example, Put put1 = new Put1("tom"),
> > put1.addaddColumn("Network:Supplier","c" ). Put put2 = new Put2("tom"),
> > put2.addaddColumn("Network:Supplier","d" ). In this situation, each data
> > version which belongs to the same row will have different TSs even if
> > logically they should have the same TSs. This situation can happen when I
> > first know the name of network supplier and later get the speed of
> > supplier.
> >
> > To lars,
> >
> > --"You have a single row with two columns?"
> >
> > This is just an example for discussion. I had a heavy discussion with the
> > other person about how to understand the right data representation and
> the
> > semantics of TS in HBase. Your explanation is one possible scenario which
> > means "user Tom has a new supplier d instead of supplier c and its speed
> is
> > 15K".
> > However, it is possible that "user Tom has both suppliers c and d and 15K
> > may belong to supplier c, as the speed of supplier d is not tested yet."
> > The second understanding is very tricky and if it happened, we need to
> > redesign the schema of database.
> >
> > So, I wonder
> > 1. If there are any predefined semantics of TS in HBase or the semantics
> of
> > TS is application-specific?
> > 2. Can anyone give any rules of how to assign TS for data versions which
> > belong to the same row?
> >
> > regards!
> >
> > Yong
> >
> >
> >
> >
> >
> > On Fri, Sep 27, 2013 at 7:02 PM, lars hofhansl  wrote:
> >
> > > Not sure I follow.
> > > You have a single row with two columns?
> > > In your scenario you'd see that supplier c has 15k iff you query the
> > > latest data, which seems to be what you want.
> > > Note that you could also query as of TS 4 (c:20k), TS3 (d:20k), TS2
> > (d:10k)
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > 
> > >  From: yonghu 
> > > To: user@hbase.apache.org
> > > Sent: Friday, September 27, 2013 7:24 AM
> > > Subject: How to understand the TS of each data version?
> > >
> > >
> > > Hello,
> > >
> > > In my understanding, the timestamp of each data version is generated by
> > Put
> > > command. The value of TS is either indicated by user or assigned by
> HBase
> > > itself. If the TS is generated by HBase, it only records when (the time
> > > point) that data version is generated (Have no meaning to the
> > application).
> > > However, if TS is indicated by user, it may have a specific meaning to
> > > applications. The reason why I want to ask this question is: How can I
> > > correctly understand the meaning of following data? Suppose I have a
> > table
> > > which is used to record the internet speed of different suppliers for
> > > specific users.
> > > For example,
> > >
> > > rk   Network:Supplier   Network:speed
> > >
> > > Tom   {d:1, c:4} {10K:1, 20K:3, 15K:5}
> > >
> > > Then I can have following different data information representations:
> > >
> > > 1. Supplier d have speeds 10K and 20K. Supplier c have 15K.
> > > 2. Supplier d have speeds 10K, 20K and 15K. We only insert the
> supplier c
> > > but has not inserted any speed information.
> > >
> > > which one is the right understanding? Anyone knows whether there are
> any
> > > predefined semantics of TS in HBase?
> > >
> > > regards!
> > >
> > > Yong
> > >
> >
>


Re: How to understand the TS of each data version?

2013-09-27 Thread yonghu
To Ted,

--"Can you tell me why readings corresponding to different timestamps would
appear in the same row ?"

Is that mean the data versions which belong to the same row should at least
have the same timestamps?

For adding a row into HBase, I can use single Put instance, for example,
Put put = new Put("tom") and put.addColumn("Network:Supplier","c" ),
put.addColmn("Network:Supplier","d"). And hence the data versions will have
the same TS.

However, I can also use multiple Put instances, each Put instance for
single data version. For example, Put put1 = new Put1("tom"),
put1.addaddColumn("Network:Supplier","c" ). Put put2 = new Put2("tom"),
put2.addaddColumn("Network:Supplier","d" ). In this situation, each data
version which belongs to the same row will have different TSs even if
logically they should have the same TSs. This situation can happen when I
first know the name of network supplier and later get the speed of
supplier.

To lars,

--"You have a single row with two columns?"

This is just an example for discussion. I had a heavy discussion with the
other person about how to understand the right data representation and the
semantics of TS in HBase. Your explanation is one possible scenario which
means "user Tom has a new supplier d instead of supplier c and its speed is
15K".
However, it is possible that "user Tom has both suppliers c and d and 15K
may belong to supplier c, as the speed of supplier d is not tested yet."
The second understanding is very tricky and if it happened, we need to
redesign the schema of database.

So, I wonder
1. If there are any predefined semantics of TS in HBase or the semantics of
TS is application-specific?
2. Can anyone give any rules of how to assign TS for data versions which
belong to the same row?

regards!

Yong





On Fri, Sep 27, 2013 at 7:02 PM, lars hofhansl  wrote:

> Not sure I follow.
> You have a single row with two columns?
> In your scenario you'd see that supplier c has 15k iff you query the
> latest data, which seems to be what you want.
> Note that you could also query as of TS 4 (c:20k), TS3 (d:20k), TS2 (d:10k)
>
>
> -- Lars
>
>
>
> 
>  From: yonghu 
> To: user@hbase.apache.org
> Sent: Friday, September 27, 2013 7:24 AM
> Subject: How to understand the TS of each data version?
>
>
> Hello,
>
> In my understanding, the timestamp of each data version is generated by Put
> command. The value of TS is either indicated by user or assigned by HBase
> itself. If the TS is generated by HBase, it only records when (the time
> point) that data version is generated (Have no meaning to the application).
> However, if TS is indicated by user, it may have a specific meaning to
> applications. The reason why I want to ask this question is: How can I
> correctly understand the meaning of following data? Suppose I have a table
> which is used to record the internet speed of different suppliers for
> specific users.
> For example,
>
> rk   Network:Supplier   Network:speed
>
> Tom   {d:1, c:4} {10K:1, 20K:3, 15K:5}
>
> Then I can have following different data information representations:
>
> 1. Supplier d have speeds 10K and 20K. Supplier c have 15K.
> 2. Supplier d have speeds 10K, 20K and 15K. We only insert the supplier c
> but has not inserted any speed information.
>
> which one is the right understanding? Anyone knows whether there are any
> predefined semantics of TS in HBase?
>
> regards!
>
> Yong
>


Re: How to understand the TS of each data version?

2013-09-27 Thread yonghu
(1,3,5) are timestamp.

regards!

Yong


On Fri, Sep 27, 2013 at 4:47 PM, Ted Yu  wrote:

> In {10K:1, 20K:3, 15K:5}, what does the value (1, 3, 5) represent ?
>
> Cheers
>
>
> On Fri, Sep 27, 2013 at 7:24 AM, yonghu  wrote:
>
> > Hello,
> >
> > In my understanding, the timestamp of each data version is generated by
> Put
> > command. The value of TS is either indicated by user or assigned by HBase
> > itself. If the TS is generated by HBase, it only records when (the time
> > point) that data version is generated (Have no meaning to the
> application).
> > However, if TS is indicated by user, it may have a specific meaning to
> > applications. The reason why I want to ask this question is: How can I
> > correctly understand the meaning of following data? Suppose I have a
> table
> > which is used to record the internet speed of different suppliers for
> > specific users.
> > For example,
> >
> > rk   Network:Supplier   Network:speed
> >
> > Tom   {d:1, c:4} {10K:1, 20K:3, 15K:5}
> >
> > Then I can have following different data information representations:
> >
> > 1. Supplier d have speeds 10K and 20K. Supplier c have 15K.
> > 2. Supplier d have speeds 10K, 20K and 15K. We only insert the supplier c
> > but has not inserted any speed information.
> >
> > which one is the right understanding? Anyone knows whether there are any
> > predefined semantics of TS in HBase?
> >
> > regards!
> >
> > Yong
> >
>


How to understand the TS of each data version?

2013-09-27 Thread yonghu
Hello,

In my understanding, the timestamp of each data version is generated by Put
command. The value of TS is either indicated by user or assigned by HBase
itself. If the TS is generated by HBase, it only records when (the time
point) that data version is generated (Have no meaning to the application).
However, if TS is indicated by user, it may have a specific meaning to
applications. The reason why I want to ask this question is: How can I
correctly understand the meaning of following data? Suppose I have a table
which is used to record the internet speed of different suppliers for
specific users.
For example,

rk   Network:Supplier   Network:speed

Tom   {d:1, c:4} {10K:1, 20K:3, 15K:5}

Then I can have following different data information representations:

1. Supplier d have speeds 10K and 20K. Supplier c have 15K.
2. Supplier d have speeds 10K, 20K and 15K. We only insert the supplier c
but has not inserted any speed information.

which one is the right understanding? Anyone knows whether there are any
predefined semantics of TS in HBase?

regards!

Yong


Re: hbase schema design

2013-09-18 Thread yonghu
Different from the RDBMS, the data in HBase is stored as key-value pair in
HDFS. Hence, for every data version in a cell, the row key will appear.


On Tue, Sep 17, 2013 at 7:53 PM, Ted Yu  wrote:

> w.r.t. Data Block Encoding, you can find some performance numbers here:
>
>
> https://issues.apache.org/jira/browse/HBASE-4218?focusedCommentId=13123337&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13123337
>
>
> On Tue, Sep 17, 2013 at 10:49 AM, Adrian CAPDEFIER
> wrote:
>
> > Thank you for confirming the rowkey is written for every cell value (I
> was
> > referring to 6.3.2 indeed). I have looked into data block encoding, but
> I'm
> > not sure that would help me (more so if I need to link this table to a
> > separate table later on).
> >
> > I will look into the surrogate value option.
> >
> >
> >
> >
> > On Tue, Sep 17, 2013 at 5:53 PM, Ted Yu  wrote:
> >
> > > I guess you were referring to section 6.3.2
> > >
> > > bq. rowkey is stored and/ or read for every cell value
> > >
> > > The above is true.
> > >
> > > bq. the event description is a string of 0.1 to 2Kb
> > >
> > > You can enable Data Block encoding to reduce storage.
> > >
> > > Cheers
> > >
> > >
> > >
> > > On Tue, Sep 17, 2013 at 9:44 AM, Adrian CAPDEFIER <
> > chivas314...@gmail.com
> > > >wrote:
> > >
> > > > Howdy all,
> > > >
> > > > I'm trying to use hbase for the first time (plenty of other
> experience
> > > with
> > > > RDBMS database though), and I have a couple of questions after
> reading
> > > The
> > > > Book.
> > > >
> > > > I am a bit confused by the advice to reduce "the row size" in the
> hbase
> > > > book. It states that every cell value is accomplished by the
> > coordinates
> > > > (row, column and timestamp). I'm just trying to be thorough, so am I
> to
> > > > understand that the rowkey is stored and/ or read for every cell
> value
> > > in a
> > > > record or just once per column family in a record?
> > > >
> > > > I am intrigued by the rows as columns design as described in the book
> > at
> > > > http://hbase.apache.org/book.html#rowkey.design. To make a long
> story
> > > > short, I will end up with a table to store event types and number of
> > > > occurrences in each day. I would prefer to have the event description
> > as
> > > > the row key and the dates when it happened as columns - up to 7300
> for
> > > > roughly 20 years.
> > > > However, the event description is a string of 0.1 to 2Kb and if it is
> > > > stored for each cell value, I will need to use a surrogate (shorter)
> > > value.
> > > >
> > > > Is there a built-in functionality to generate (integer) surrogate
> > values
> > > in
> > > > hbase that can be used on the rowkey or does it need to be hand code
> it
> > > > from scratch?
> > > >
> > >
> >
>


Re: one column family but lots of tables

2013-08-24 Thread yonghu
I think you can take a look at the
http://hbase.apache.org/book/regions.arch.html, it describes the data
storage hierarchy of HBase.  Due to the statement of Lars "stems from the
fact that HBase flushes by region (which means all stores of that region
are flushed)", you can think the limitations of column-families like this
way. If the size of the region is fixed, the more column-families you
create, the less space you can get for each column-family. As the flush
operation is based on region, if one column-family research its size
threshold even if the other column-families are empty, the flush operations
will be triggered. Roughly speaking, the more column-families you create,
the more I/O cost you will get. I think if the flush is based on
column-family rather than region, maybe we can release this limitation.


On Sat, Aug 24, 2013 at 5:47 PM, Koert Kuipers  wrote:

> thanks i think it makes sense. i will go through hbase architecture again
> to make sure i fully understand the mapping to "regions" and "stores"
>
>
> On Fri, Aug 23, 2013 at 10:33 AM, Michael Segel
> wrote:
>
> > I think the issue which a lot of people miss is why do you want to use a
> > column family in the first place.
> >
> > Column families are part of the same table structure, and each family is
> > kept separate.
> >
> > So in your design, do you have tables which are related, but are not
> > always used at the same time?
> >
> > The example that I use when I teach about HBase or do a
> > lecture/presentation is an Order Entry system.
> > Here you have an order being entered, then you have  one or many pick
> > slips being generated, same for shipping then there's the invoicing.
> > All separate processes which relate back to the same order.
> >
> > So here it makes sense to use column families.
> >
> > Other areas could be metadata is surrounding a transaction. Again... few
> > column families are tied together.
> >
> > Does this make sense?
> >
> >
> > On Aug 23, 2013, at 12:35 AM, lars hofhansl  wrote:
> >
> > > You can think of it this way: Every region and column family is a
> > "store" in HBase. Each store has a memstore and its own set of HFiles in
> > HDFS.
> > > The more stores you have, the more there is to manage.
> > >
> > > So you want to limit the number of stores. Also note that the word
> > "Table" is somewhat a misnomer in HBase it should have better been called
> > "Keyspace".
> > > The extra caution for the number of column families per table stems
> from
> > the fact that HBase flushes by region (which means all stores of that
> > region are flushed). This in turn means that unless are column families
> > hold roughly the same amount of data you end up with very lopsided
> > distributions of HFile sizes.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > 
> > > From: Koert Kuipers 
> > > To: user@hbase.apache.org; vrodio...@carrieriq.com
> > > Sent: Thursday, August 22, 2013 12:30 PM
> > > Subject: Re: one column family but lots of tables
> > >
> > >
> > > if that is the case, how come people keep warning about limiting the
> > number
> > > of column families to only a handful (with more hbase performance will
> > > degrade supposedly), yet there seems to be no similar warnings for
> number
> > > of tables? see for example here:
> > > http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/27616
> > >
> > > if a table means at least one column family then the number of tables
> > > should also be kept to a minumum, no?
> > >
> > >
> > >
> > >
> > > On Thu, Aug 22, 2013 at 1:58 PM, Vladimir Rodionov
> > > wrote:
> > >
> > >> Nope. Column family is per table (its sub-directory inside your table
> > >> directory in HDFS).
> > >> If you have N tables you will always have , at least, N distinct CFs
> > (even
> > >> if they have the same name).
> > >>
> > >> Best regards,
> > >> Vladimir Rodionov
> > >> Principal Platform Engineer
> > >> Carrier IQ, www.carrieriq.com
> > >> e-mail: vrodio...@carrieriq.com
> > >>
> > >> 
> > >> From: Koert Kuipers [ko...@tresata.com]
> > >> Sent: Thursday, August 22, 2013 8:06 AM
> > >> To: user@hbase.apache.org
> > >> Subject: one column family but lots of tables
> > >>
> > >> i read in multiple places that i should try to limit the number of
> > column
> > >> families in hbase.
> > >>
> > >> do i understand it correctly that when i create lots of tables, but
> they
> > >> all use the same column family (by name), that i am just using one
> > column
> > >> family and i am OK with respect to limiting number of column families
> ?
> > >>
> > >> thanks! koert
> > >>
> > >> Confidentiality Notice:  The information contained in this message,
> > >> including any attachments hereto, may be confidential and is intended
> > to be
> > >> read only by the individual or entity to whom this message is
> > addressed. If
> > >> the reader of this message is not the intended recipient or an agent
> or
> > >> designee of the intend

Re: Does HBase supports parallel table scan if I use MapReduce

2013-08-21 Thread yonghu
Thanks. So, to scan the table just using the java program without using
MapReduce will heavily decrease the performance.

Yong


On Tue, Aug 20, 2013 at 6:02 PM, Jeff Kolesky  wrote:

> The scan will be broken up into multiple map tasks, each of which will run
> over a single split of the table (look at TableInputFormat to see how it is
> done).  The map tasks will run in parallel.
>
> Jeff
>
>
> On Tue, Aug 20, 2013 at 8:45 AM, yonghu  wrote:
>
> > Hello,
> >
> > I know if I use default scan api, HBase scans table in a serial manner,
> as
> > it needs to guarantee the order of the returned tuples. My question is
> if I
> > use MapReduce to read the HBase table, and directly output the results in
> > HDFS, not returned back to client. The HBase scan is still in a serial
> > manner or in this situation it can run a parallel scan.
> >
> > Thanks!
> >
> > Yong
> >
>
>
>
> --
> *Jeff Kolesky*
> Chief Software Architect
> *Opower*
>


Does HBase supports parallel table scan if I use MapReduce

2013-08-20 Thread yonghu
Hello,

I know if I use default scan api, HBase scans table in a serial manner, as
it needs to guarantee the order of the returned tuples. My question is if I
use MapReduce to read the HBase table, and directly output the results in
HDFS, not returned back to client. The HBase scan is still in a serial
manner or in this situation it can run a parallel scan.

Thanks!

Yong


Re: slow operation in postPut

2013-08-01 Thread yonghu
Use HTablePool instead. For more infor,
http://hbase.apache.org/book/client.html.


On Thu, Aug 1, 2013 at 3:32 PM, yonghu  wrote:

> If I want to use multi-thread with thread safe, which class should I use?
>
>
> On Thu, Aug 1, 2013 at 3:08 PM, Ted Yu  wrote:
>
>> HTable is not thread safe.
>>
>> On Aug 1, 2013, at 5:58 AM, Pavel Hančar  wrote:
>>
>> > Hello,
>> > I have a class extending BaseRegionObserver and I use the postPut
>> method to
>> > run a slow procedure. I'd like to run more these procedures in more
>> > threads. Is it possible to run more HTable.put(put) methods
>> concurrently? I
>> > tried, but I have this error for each thread:
>> >
>> > Exception in thread "Thread-3" java.lang.IndexOutOfBoundsException:
>> Index:
>> > 1, Size: 1
>> >at java.util.ArrayList.rangeCheck(ArrayList.java:604)
>> >at java.util.ArrayList.remove(ArrayList.java:445)
>> >at
>> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:966)
>> >at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:811)
>> >at org.apache.hadoop.hbase.client.HTable.put(HTable.java:786)
>> >at img.PutFilesThread.run(PutFilesThread.java:74)
>> >at java.lang.Thread.run(Thread.java:724)
>> >
>> > Anybody has an idea?
>> >  Thanks,
>> >  Pavel Hančar
>>
>
>


Re: slow operation in postPut

2013-08-01 Thread yonghu
If I want to use multi-thread with thread safe, which class should I use?


On Thu, Aug 1, 2013 at 3:08 PM, Ted Yu  wrote:

> HTable is not thread safe.
>
> On Aug 1, 2013, at 5:58 AM, Pavel Hančar  wrote:
>
> > Hello,
> > I have a class extending BaseRegionObserver and I use the postPut method
> to
> > run a slow procedure. I'd like to run more these procedures in more
> > threads. Is it possible to run more HTable.put(put) methods
> concurrently? I
> > tried, but I have this error for each thread:
> >
> > Exception in thread "Thread-3" java.lang.IndexOutOfBoundsException:
> Index:
> > 1, Size: 1
> >at java.util.ArrayList.rangeCheck(ArrayList.java:604)
> >at java.util.ArrayList.remove(ArrayList.java:445)
> >at
> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:966)
> >at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:811)
> >at org.apache.hadoop.hbase.client.HTable.put(HTable.java:786)
> >at img.PutFilesThread.run(PutFilesThread.java:74)
> >at java.lang.Thread.run(Thread.java:724)
> >
> > Anybody has an idea?
> >  Thanks,
> >  Pavel Hančar
>


Re: Data missing in import bulk data

2013-07-24 Thread yonghu
The ways that you can lose data in my point of views:

1. some tuples share the same row-key+cf+column. Hence, when you load your
data in HBase, they will be loaded into the same column and may exceed the
predefined max version.

2. As Ted mentioned, you may import some delete, do you generate tombstones
in your bulk load?

By the way, can you show us the schema of your imported data, like whether
it contains duplicates,  how is your row key design?

regards!

Yong


On Wed, Jul 24, 2013 at 3:55 AM, Ted Yu  wrote:

> Which HBase release are you using ?
>
> Was it possible that the import included Delete's ?
>
> Cheers
>
> On Tue, Jul 23, 2013 at 5:23 PM, Huangmao (Homer) Quan  >wrote:
>
> > Hi hbase users,
> >
> > We got an issue when import data from thrift (perl)
> >
> > We found the number of data is less than expected.
> >
> > when scan the table, we got:
> > ERROR: java.lang.RuntimeException:
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > attempts=7, exceptions:
> > Tue Jul 23 23:01:41 UTC 2013,
> > org.apache.hadoop.hbase.client.ScannerCallable@180f9720,
> > java.io.IOException: java.io.IOException: Could not iterate
> > StoreFileScanner[HFileScanner for reader
> >
> >
> reader=file:/tmp/hbase-hbase/hbase/skg/d13644aae91d7ee9a8fdde461e8ec217/wrapstar/51a2e5871b7a4af8a2d9d17ed0c14031,
> > compression=none, cacheConf=CacheConfig:enabled [cacheDataOnRead=false]
> > [cacheDataOnWrite=false] [cacheIndexesOnWrite=false]
> > [cacheBloomsOnWrite=false] [cacheEvictOnClose=false]
> > [cacheCompressed=false], firstKey="Laughing"Larry
> > Berger-nm5619461/wrapstar:data/1374615644669/Put, lastKey=Jordan-Patrick
> > Marcantonio-nm0545093/wrapstar:data/137461643/Put, avgKeyLen=47,
> > avgValueLen=652, entries=156586, length=111099401, cur=George
> > McGovern-nm0569566/wrapstar:data/1374616538067/Put/vlen=17162/ts=0]
> >
> >
> > And even weird, when I monitoring the row number during import, I found
> > some time the row number decrease sharply (lots of data missing)
> >
> > hbase(main):003:0> count 'skgtwo'
> > .
> > *134453 row(s)* in 7.5510 seconds
> >
> > hbase(main):004:0> count 'skgtwo'
> > ...
> > *88970 row(s)* in 7.5380 seconds
> >
> > Any suggestion is appreciated.
> >
> > Cheers
> >
> > †Huangmao (Homer) Quan
> > Email:   luj...@gmail.com
> > Google Voice: +1 (530) 903-8125
> > Facebook: http://www.facebook.com/homerquan
> > Linkedin: http://www.linkedin.com/in/homerquan<
> > http://www.linkedin.com/in/earthisflat>
> >
>


Re: How to join 2 tables using hadoop?

2013-07-19 Thread yonghu
 You can write one MR job to finish this. First read two tables at Map
function, the output key will be the reference key for one table and
primary key for the other table. At the Reduce function, you can "join" the
tuples which contain the same key. Please note this is a very naive
approach, for more join optimization options,  you can take a look at the
strategies which Pig or Hive uses.




On Fri, Jul 19, 2013 at 10:17 AM, Nitin Pawar wrote:

> Try hive with hbase storage handler
>
>
> On Fri, Jul 19, 2013 at 9:54 AM, Pavan Sudheendra  >wrote:
>
> > Hi,
> >
> > I know that HBase by default doesn't support table joins like RDBMS..
> > But anyway, I have a table who value contains a json with a particular
> > ID in it..
> > This id references another table where it is a key..
> >
> > I want to fetch the id first from table A , query table 2 and get its
> > corresponding value..
> >
> > What is the best way of achieving this using the MR framework?
> > Apologizes, i'm still new to Hadoop and HBase so please go easy on me.
> >
> > Thanks for any help
> >
> > --
> > Regards-
> > Pavan
> >
>
>
>
> --
> Nitin Pawar
>


Re: several doubts about region split?

2013-07-17 Thread yonghu
Thanks for your quick response!

For the question one, what will be the latency? How long we need to wait
until the daughter regions are again online?

regards!

Yong



On Wed, Jul 17, 2013 at 4:05 PM, Ted Yu  wrote:

> bq. Does it mean the region which will be splitted is not available
> anymore?
>
> Right.
>
> bq. What happened to the read and write requests to that region?
>
> The requests wouldn't be served by the hosting region server until daughter
> regions become online.
>
> Will try to dig up answer to question #2.
> In short, load balancer is supposed to offload one of the daughter regions
> if continuous write load incurs.
>
> Cheers
>
> On Wed, Jul 17, 2013 at 6:53 AM, yonghu  wrote:
>
> > Dear all,
> >
> > From the HBase reference book, it mentions that when RegionServer splits
> > regions, it will offline the split region and then adds the daughter
> > regions to META, opens daughters on the parent's hosting RegionServer and
> > then reports the split to the Master.
> >
> > I have a several questions:
> >
> > 1. What does offline means? Does it mean the region which will be
> splitted
> > is not available anymore? What happened to the read and write requests to
> > that region?
> >
> > 2. From the description, if I understand right it means that now the
> > RegionServer will contain two Regions (One RegionServer for both daughter
> > and parent regions ) instead of one RegionSever for daughter and one for
> > parent. If it is, what are the benefits of this approach? Hot-spot
> problem
> > is still there. Moreover, this approach will be a big problem if we use
> the
> > HBase default split approach. Suppose we bulk load data into HBase
> cluster,
> > initially every write request will be accepted by only one RegionServer.
> > After some write requests, the RegionServer cannot response any write
> > request as it reaches its disk volume threshold. Hence, some data must be
> > removed from one RegionSever to the other RegionServer. The question is
> > that why we don't do it at the region split time?
> >
> > Thanks!
> >
> > Yong
> >
>


several doubts about region split?

2013-07-17 Thread yonghu
Dear all,

>From the HBase reference book, it mentions that when RegionServer splits
regions, it will offline the split region and then adds the daughter
regions to META, opens daughters on the parent's hosting RegionServer and
then reports the split to the Master.

I have a several questions:

1. What does offline means? Does it mean the region which will be splitted
is not available anymore? What happened to the read and write requests to
that region?

2. From the description, if I understand right it means that now the
RegionServer will contain two Regions (One RegionServer for both daughter
and parent regions ) instead of one RegionSever for daughter and one for
parent. If it is, what are the benefits of this approach? Hot-spot problem
is still there. Moreover, this approach will be a big problem if we use the
HBase default split approach. Suppose we bulk load data into HBase cluster,
initially every write request will be accepted by only one RegionServer.
After some write requests, the RegionServer cannot response any write
request as it reaches its disk volume threshold. Hence, some data must be
removed from one RegionSever to the other RegionServer. The question is
that why we don't do it at the region split time?

Thanks!

Yong


a question about assigning timestamp?

2013-07-13 Thread yonghu
Hello,

>From the reference book "5.8.1.4. Put", if I issue a "Put" command without
specifying timestamp, the server will generate a TS for me! I wonder if the
"server" means the master node or regionservers? In my understanding, the
server means the regionserver, as the master will only tells the address
which client should interact to finish the Put operation. Am I right?

regards!

yong


Re: Deleting range of rows from Hbase.

2013-07-04 Thread yonghu
So, I can understand it as a predefined coprocessor. :)

regards!

Yong


On Thu, Jul 4, 2013 at 1:12 PM, Anoop John  wrote:

> BulkDeleteEndpoint  is a coprocessor endpoint impl.  For the usage pls
> refer TestBulkDeleteProtocol.
> You will be able to call the API at client side and the actual execution
> will happen at server side.. (This is what will happen with Endpoints :)  )
>
> -Anoop-
>
> On Thu, Jul 4, 2013 at 4:29 PM, yonghu  wrote:
>
> > Hi Anoop
> > one more question. Can I use BulkDeleteEndpoint at the client side or
> > should I use it like coprocessor which deployed in the server side?
> >
> > Thanks!
> >
> > Yong
> >
> >
> > On Thu, Jul 4, 2013 at 12:50 PM, Anoop John 
> wrote:
> >
> > > It is not supported from shell.  Not directly from delete API also..
> > >
> > > You can have a look at BulkDeleteEndpoint which can do what you want to
> > >
> > > -Anoop-
> > >
> > > On Thu, Jul 4, 2013 at 4:09 PM, yonghu  wrote:
> > >
> > > > I check the latest api of Delete class. I am afraid you have to do it
> > by
> > > > yourself.
> > > >
> > > > regards!
> > > >
> > > > Yong
> > > >
> > > >
> > > > On Wed, Jul 3, 2013 at 6:46 PM, Rahul Bhattacharjee <
> > > > rahul.rec@gmail.com
> > > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Like scan with range. I would like to delete rows with range.Is
> this
> > > > > supported from hbase shell ?
> > > > >
> > > > > Lets say I have a table with keys like
> > > > >
> > > > > A-34335
> > > > > A-34353535
> > > > > A-335353232
> > > > > B-33435
> > > > > B-4343
> > > > > C-5353533
> > > > >
> > > > > I want to delete all rows with prefix A using hbase shell.
> > > > >
> > > > > Thanks,
> > > > > Rahul
> > > > >
> > > >
> > >
> >
>


Re: Deleting range of rows from Hbase.

2013-07-04 Thread yonghu
Hi Anoop
one more question. Can I use BulkDeleteEndpoint at the client side or
should I use it like coprocessor which deployed in the server side?

Thanks!

Yong


On Thu, Jul 4, 2013 at 12:50 PM, Anoop John  wrote:

> It is not supported from shell.  Not directly from delete API also..
>
> You can have a look at BulkDeleteEndpoint which can do what you want to
>
> -Anoop-
>
> On Thu, Jul 4, 2013 at 4:09 PM, yonghu  wrote:
>
> > I check the latest api of Delete class. I am afraid you have to do it by
> > yourself.
> >
> > regards!
> >
> > Yong
> >
> >
> > On Wed, Jul 3, 2013 at 6:46 PM, Rahul Bhattacharjee <
> > rahul.rec@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > Like scan with range. I would like to delete rows with range.Is this
> > > supported from hbase shell ?
> > >
> > > Lets say I have a table with keys like
> > >
> > > A-34335
> > > A-34353535
> > > A-335353232
> > > B-33435
> > > B-4343
> > > C-5353533
> > >
> > > I want to delete all rows with prefix A using hbase shell.
> > >
> > > Thanks,
> > > Rahul
> > >
> >
>


Re: Deleting range of rows from Hbase.

2013-07-04 Thread yonghu
I check the latest api of Delete class. I am afraid you have to do it by
yourself.

regards!

Yong


On Wed, Jul 3, 2013 at 6:46 PM, Rahul Bhattacharjee  wrote:

> Hi,
>
> Like scan with range. I would like to delete rows with range.Is this
> supported from hbase shell ?
>
> Lets say I have a table with keys like
>
> A-34335
> A-34353535
> A-335353232
> B-33435
> B-4343
> C-5353533
>
> I want to delete all rows with prefix A using hbase shell.
>
> Thanks,
> Rahul
>


Re: HBase failure scenarios

2013-06-10 Thread yonghu
Hi Lucas,

First, the write request for HBase consists of two parts:
1. Write into WAL;
2. Write into Memstore, when Memstore reaches the threshold, the data in
Memstore will be flushed into disk.

In my understanding, there are two data synchronization points:

The first one is write to WAL. As WAL is persistent on the local disk, it
will be propagated into the other 2 nodes (suppose the replica number is 3).
The second on is when Memstore reaches the threshold, and the data in
Memsotre will be flushed into disk. When this happens, it will also cause
the the pipeline data writing.

regards

Yong



On Tue, Jun 11, 2013 at 2:39 AM, Lucas Stanley  wrote:

> Thanks Azuryy!
>
> So, when a write is successful to the WAL on the responsible region server,
> in fact that means that the write was committed to 3 total DataNodes,
> correct?
>
>
> On Mon, Jun 10, 2013 at 5:37 PM, Azuryy Yu  wrote:
>
> > yes. datanode write is pipeline. and only if pipeline writing finished,
> dn
> > return ok.
> >
> > --Send from my Sony mobile.
> > On Jun 11, 2013 8:27 AM, "Lucas Stanley"  wrote:
> >
> > > Hi,
> > >
> > > In the Strata 2013 training lectures, Jonathan Hsieh from Cloudera said
> > > something about HBase syncs which I'm trying to understand further.
> > >
> > > He said that HBase sync guarantees only that a write goes to the local
> > disk
> > > on the region server responsible for that region and in-memory copies
> go
> > on
> > > 2 other machines in the HBase cluster.
> > >
> > > But I thought that when the write goes to the WAL on the first region
> > > server, that the HDFS append would push that write to 3 machines total
> in
> > > the HDFS cluster. In order for the append write to the WAL to be
> > > successful, doesn't the DataNode on that machine have to pipeline the
> > write
> > > to 2 other DataNodes?
> > >
> > > I'm not sure what Jonathan was referring to when he said that 2
> in-memory
> > > copies go to other HBase machines? Even when the memstore on the first
> > > region server gets full, doesn't the flush to the HFile get written on
> 3
> > > HDFS nodes in total?
> > >
> >
>


Re: Poor HBase map-reduce scan performance

2013-06-05 Thread yonghu
Dear Sandy,

Thanks for your explanation.

However, what I don't get is your term "client", is this "client" means
MapReduce jobs? If I understand you right, this means Map function will
process the tuples and during this processing time, the regionserver did
nothing?

regards!

Yong


On Wed, Jun 5, 2013 at 6:12 PM, Ted Yu  wrote:

> bq. the Regionserver and Tasktracker are the same node when you use
> MapReduce to scan the HBase table.
>
> The scan performed by the Tasktracker on that node would very likely access
> data hosted by region server on other node(s). So there would be RPC
> involved.
>
> There is some discussion on providing shadow reads - writes to specific
> region are solely served by one region server but the reads can be served
> by more than one region server. Of course consistency is one aspect that
> must be tackled.
>
> Cheers
>
> On Wed, Jun 5, 2013 at 7:55 AM, yonghu  wrote:
>
> > Can anyone explain why client + rpc + server will decrease the
> performance
> > of scanning? I mean the Regionserver and Tasktracker are the same node
> when
> > you use MapReduce to scan the HBase table. So, in my understanding, there
> > will be no rpc cost.
> >
> > Thanks!
> >
> > Yong
> >
> >
> > On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt  wrote:
> >
> > > https://issues.apache.org/jira/browse/HBASE-8691
> > >
> > >
> > > On 6/4/13 6:11 PM, "Sandy Pratt"  wrote:
> > >
> > > >Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> > > >with an update in the meantime.
> > > >
> > > >I tried a number of different approaches to eliminate latency and
> > > >"bubbles" in the scan pipeline, and eventually arrived at adding a
> > > >streaming scan API to the region server, along with refactoring the
> scan
> > > >interface into an event-drive message receiver interface.  In so
> doing,
> > I
> > > >was able to take scan speed on my cluster from 59,537 records/sec with
> > the
> > > >classic scanner to 222,703 records per second with my new scan API.
> > > >Needless to say, I'm pleased ;)
> > > >
> > > >More details forthcoming when I get a chance.
> > > >
> > > >Thanks,
> > > >Sandy
> > > >
> > > >On 5/23/13 3:47 PM, "Ted Yu"  wrote:
> > > >
> > > >>Thanks for the update, Sandy.
> > > >>
> > > >>If you can open a JIRA and attach your producer / consumer scanner
> > there,
> > > >>that would be great.
> > > >>
> > > >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt 
> > wrote:
> > > >>
> > > >>> I wrote myself a Scanner wrapper that uses a producer/consumer
> queue
> > to
> > > >>> keep the client fed with a full buffer as much as possible.  When
> > > >>>scanning
> > > >>> my table with scanner caching at 100 records, I see about a 24%
> > uplift
> > > >>>in
> > > >>> performance (~35k records/sec with the ClientScanner and ~44k
> > > >>>records/sec
> > > >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> > > >>>it's
> > > >>> more of a wash compared to the standard ClientScanner: ~53k
> > records/sec
> > > >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> > > >>>
> > > >>> I'm not sure what to make of those results.  I think next I'll shut
> > > >>>down
> > > >>> HBase and read the HFiles directly, to see if there's a drop off in
> > > >>> performance between reading them directly vs. via the RegionServer.
> > > >>>
> > > >>> I still think that to really solve this there needs to be sliding
> > > >>>window
> > > >>> of records in flight between disk and RS, and between RS and
> client.
> > > >>>I'm
> > > >>> thinking there's probably a single batch of records in flight
> between
> > > >>>RS
> > > >>> and client at the moment.
> > > >>>
> > > >>> Sandy
> > > >>>
> > > >>> On 5/23/13 8:45 AM, "Bryan Keller"  wrote:
> > > >>>
> > > >>> >I am considering scanning a snapshot instead of the table. I
> believe
> > > >>>this
> > > >>> >is what the ExportSnapshot class does. If I could use the scanning
> > > >>>code
> > > >>> >from ExportSnapshot then I will be able to scan the HDFS files
> > > >>>directly
> > > >>> >and bypass the regionservers. This could potentially give me a
> huge
> > > >>>boost
> > > >>> >in performance for full table scans. However, it doesn't really
> > > >>>address
> > > >>> >the poor scan performance against a table.
> > > >>>
> > > >>>
> > > >
> > >
> > >
> >
>


Re: Poor HBase map-reduce scan performance

2013-06-05 Thread yonghu
Can anyone explain why client + rpc + server will decrease the performance
of scanning? I mean the Regionserver and Tasktracker are the same node when
you use MapReduce to scan the HBase table. So, in my understanding, there
will be no rpc cost.

Thanks!

Yong


On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt  wrote:

> https://issues.apache.org/jira/browse/HBASE-8691
>
>
> On 6/4/13 6:11 PM, "Sandy Pratt"  wrote:
>
> >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
> >with an update in the meantime.
> >
> >I tried a number of different approaches to eliminate latency and
> >"bubbles" in the scan pipeline, and eventually arrived at adding a
> >streaming scan API to the region server, along with refactoring the scan
> >interface into an event-drive message receiver interface.  In so doing, I
> >was able to take scan speed on my cluster from 59,537 records/sec with the
> >classic scanner to 222,703 records per second with my new scan API.
> >Needless to say, I'm pleased ;)
> >
> >More details forthcoming when I get a chance.
> >
> >Thanks,
> >Sandy
> >
> >On 5/23/13 3:47 PM, "Ted Yu"  wrote:
> >
> >>Thanks for the update, Sandy.
> >>
> >>If you can open a JIRA and attach your producer / consumer scanner there,
> >>that would be great.
> >>
> >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt  wrote:
> >>
> >>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
> >>> keep the client fed with a full buffer as much as possible.  When
> >>>scanning
> >>> my table with scanner caching at 100 records, I see about a 24% uplift
> >>>in
> >>> performance (~35k records/sec with the ClientScanner and ~44k
> >>>records/sec
> >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> >>>it's
> >>> more of a wash compared to the standard ClientScanner: ~53k records/sec
> >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> >>>
> >>> I'm not sure what to make of those results.  I think next I'll shut
> >>>down
> >>> HBase and read the HFiles directly, to see if there's a drop off in
> >>> performance between reading them directly vs. via the RegionServer.
> >>>
> >>> I still think that to really solve this there needs to be sliding
> >>>window
> >>> of records in flight between disk and RS, and between RS and client.
> >>>I'm
> >>> thinking there's probably a single batch of records in flight between
> >>>RS
> >>> and client at the moment.
> >>>
> >>> Sandy
> >>>
> >>> On 5/23/13 8:45 AM, "Bryan Keller"  wrote:
> >>>
> >>> >I am considering scanning a snapshot instead of the table. I believe
> >>>this
> >>> >is what the ExportSnapshot class does. If I could use the scanning
> >>>code
> >>> >from ExportSnapshot then I will be able to scan the HDFS files
> >>>directly
> >>> >and bypass the regionservers. This could potentially give me a huge
> >>>boost
> >>> >in performance for full table scans. However, it doesn't really
> >>>address
> >>> >the poor scan performance against a table.
> >>>
> >>>
> >
>
>


Re: Change data capture tool for hbase

2013-06-03 Thread yonghu
Hello,

I have presented 5 CDC approaches based on HBase and published my results
in adbis 2013.

regards!

Yong


On Mon, Jun 3, 2013 at 11:16 AM, yavuz gokirmak  wrote:

> Hi all,
>
> Currently we are working on a hbase change data capture (CDC) tool. I want
> to share our ideas and continue development according to your feedback.
>
> As you know CDC tools are used for tracking the data changes and take
> actions according to these changes[1].  For example in relational
> databases, CDC tools are mainly used for replication. You can replicate
> your source system continuously to another location or db using CDC tool.So
> whenever an insert/update/delete is done on the source system, you can
> reflect the same operation to the replicated environment.
>
> As I've said, we are working on a CDC tool that can track changes on a
> hbase table and reflect those changes to any other system in real-time.
>
> What we are trying to implement the tool in a way that he will behave as a
> slave cluster. So if we enable master-master replication in the source
> system, we expect to get all changes and act accordingly. Once the proof of
> concept cdc tool is implemented ( we need one week ) we will convert it to
> a flume source. So using it as a flume source we can direct data changes to
> any destination (sink)
>
> This is just a summary.
> Please write your feedback and comments.
>
> Do you know any tool similar to this proposal?
>
> regards.
>
>
>
>
>
> 1- http://en.wikipedia.org/wiki/Change_data_capture
>


Re: Doubt Regading HLogs

2013-05-17 Thread yonghu
In this situation, you can set the

> 
>
> hbase.regionserver.
logroll.period
>
> 360
>
> 

to a short value, let's say 3000 and then you can see your log file with
current size after 3 seconds.

To Nicolas,

I guess he wants somehow to analyze the HLog.

regards!

Yong



On Fri, May 17, 2013 at 1:27 PM, Rishabh Agrawal <
rishabh.agra...@impetus.co.in> wrote:

> Thanks Nicolas,
>
> When will  this file be finalized. Is it time bound? Or it will be always
> be zero for last one (even if it contains the data)
>
> -Original Message-
> From: Nicolas Liochon [mailto:nkey...@gmail.com]
> Sent: Friday, May 17, 2013 4:39 PM
> To: user
> Subject: Re: Doubt Regading HLogs
>
> That's HDFS.
>
> When a file is currently written, the size is not known, as the write is
> in progress. So the namenode reports a size of zero (more exactly, it does
> not take into account the hdfs block beeing written when it calculates the
> size). When you read, you go to the datanode owning the data, so you see
> the real content as it is at the time of reading.
>
> btw, why do you want to read the HLog?
>
>
> On Fri, May 17, 2013 at 12:53 PM, Rishabh Agrawal <
> rishabh.agra...@impetus.co.in> wrote:
>
> >  Hello,
> >
> > I am working with Hlogs of Hbase and I have this doubt that HDFS shows
> > size of last log file as zero. But when I open it I see data in it.
> > When I add extra data a new file with zero size is created and
> > previous HLog file gets its size.  This thing applies to each region
> > server.  Following is the purged screen shot of same:
> >
> >
> >
> > I have set following parameters in hbase-site.xml for logs:
> >
> > 
> >
> > hbase.regionserver.logroll.period
> >
> > 360
> >
> > 
> >
> > 
> >
> >hbase.master.logcleaner.ttl
> >
> >60480
> >
> > 
> >
> > 
> >
> >hbase.regionserver.optionallogflushinterval
> >
> >3000
> >
> > 
> >
> >
> >
> >
> >
> > I plan to read log files for some validation work, Please guide me
> > through this behavior of Hbase.
> >
> >
> >
> >
> >
> > Thanks and Regards
> >
> > Rishabh Agrawal
> >
> > Software Engineer
> >
> > Impetus Infotech (India) Pvt. Ltd.
> >
> > (O) +91.731.426.9300 x4526
> >
> > (M) +91.812.026.2722
> >
> > www.impetus.com
> >
> >
> >
> > --
> >
> >
> >
> >
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited
> > when received in error. Impetus does not represent, warrant and/or
> > guarantee, that the integrity of this communication has been
> > maintained nor that the communication is free of errors, virus,
> interception or interference.
> >
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


How can I set heap size for HBase?

2013-03-11 Thread yonghu
Dear All,

I wonder how I can set heap size for HBase and what is the suitable
portion compared to whole memory size?
And the other question is that how many memory size I need to give to
Java when I run HBase, as I sometimes got "out of memory" problems.

Thanks!

Yong


Re: Possible to delete a specific cell?

2013-03-07 Thread yonghu
Hello,

I think you can use HBase org.apache.hadoop.hbase.client.Delete class.
It already supported to delete a specific version in a cell, see
public Delete deleteColumn(byte[] family, byte[] qualifier, long
timestamp) method.

regards!

Yong

On Thu, Mar 7, 2013 at 9:25 PM, Jonathan Natkins  wrote:
> For those who care, it turns out that this use case is currently possible,
> just not through the hbase shell:
>
> hbase(main):002:0> scan 'del_test', { VERSIONS => 10 }
> ROW   COLUMN+CELL
>
>
>  key  column=f1:c1, timestamp=5,
> value=value5
>
>  key  column=f1:c1, timestamp=4,
> value=value4
>
>  key  column=f1:c1, timestamp=2,
> value=value2
>
>  key  column=f1:c1, timestamp=1,
> value=value1
>
> 1 row(s) in 0.0650 seconds
>
> Here's a small program to try it yourself:
> https://gist.github.com/jnatkins/5111513
>
>
>
> On Thu, Mar 7, 2013 at 10:49 AM, Kevin O'dell wrote:
>
>> Ted,
>>
>>  Yes that is correct, sorry 3 is newer than 1 when speak TSs.  Sorry for
>> the confusion :)
>>
>> On Thu, Mar 7, 2013 at 1:48 PM, Ted Yu  wrote:
>>
>> > I think there was typo in Kevin's email: t3 should be t1
>> >
>> > On Thu, Mar 7, 2013 at 10:42 AM, Kevin O'dell > > >wrote:
>> >
>> > > JM,
>> > >
>> > >   If you delete t2, you will also wipe out t3 right now.
>> > >
>> > > On Thu, Mar 7, 2013 at 1:37 PM, Jean-Marc Spaggiari <
>> > > jean-m...@spaggiari.org
>> > > > wrote:
>> > >
>> > > > Kevin,
>> > > >
>> > > > How do you see that? Like a specific cell format which can "cancel"
>> > > > once timestamp and no delete all the previous one?
>> > > >
>> > > > Like before compaction we can have
>> > > >
>> > > > v1:t1
>> > > > v1:t2
>> > > > v1:t3
>> > > > v1:d2 <= Delete only t2 version.
>> > > >
>> > > > And at compaction time we only keep that in mind and give this as a
>> > > result:
>> > > > v1:t1
>> > > > v1:t3
>> > > >
>> > > > ?
>> > > >
>> > > > 2013/3/7 Jeff Kolesky :
>> > > > > Yes, this behavior would be fantastic.  If you follow the Kiji/Wibi
>> > > model
>> > > > > of using many versioned cells, being able to delete a specific cell
>> > > > without
>> > > > > deleting all cells prior to it would be very useful.
>> > > > >
>> > > > > Jeff
>> > > > >
>> > > > >
>> > > > > On Thu, Mar 7, 2013 at 10:26 AM, Kevin O'dell <
>> > > kevin.od...@cloudera.com
>> > > > >wrote:
>> > > > >
>> > > > >> The problem is it kills all older cells.  We should probably file
>> a
>> > > JIRA
>> > > > >> for this, as this behavior would be nice.  Thoughts?:
>> > > > >>
>> > > > >> hbase(main):028:0> truncate 'tre'
>> > > > >>
>> > > > >> Truncating 'tre' table (it may take a while):
>> > > > >>
>> > > > >> - Disabling table...
>> > > > >>
>> > > > >> - Dropping table...
>> > > > >>
>> > > > >> - Creating table...
>> > > > >>
>> > > > >> 0 row(s) in 4.6060 seconds
>> > > > >>
>> > > > >>
>> > > > >> hbase(main):029:0> put 'tre', 'row1', 'cf1:c1', 'abc', 111
>> > > > >>
>> > > > >> 0 row(s) in 0.0220 seconds
>> > > > >>
>> > > > >>
>> > > > >> hbase(main):030:0> put 'tre', 'row1', 'cf1:c1', 'abcd', 112
>> > > > >>
>> > > > >> 0 row(s) in 0.0060 seconds
>> > > > >>
>> > > > >>
>> > > > >> hbase(main):031:0> put 'tre', 'row1', 'cf1:c1', 'abce', 113
>> > > > >>
>> > > > >> 0 row(s) in 0.0120 seconds
>> > > > >>
>> > > > >>
>> > > > >> hbase(main):032:0> scan 'tre', {NAME => 'cf1:c1', VERSIONS => 4}
>> > > > >>
>> > > > >> ROW
>> > >  COLUMN+CELL
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> row1
>> > > >  column=cf1:c1,
>> > > > >> timestamp=113, value=abce
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> row1
>> > > >  column=cf1:c1,
>> > > > >> timestamp=112, value=abcd
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> row1
>> > > >  column=cf1:c1,
>> > > > >> timestamp=111, value=abc
>> > > > >>
>> > > > >>
>> > > > >> hbase(main):033:0> delete 'tre', 'row1', 'cf1:c1', 112
>> > > > >>
>> > > > >> 0 row(s) in 0.0110 seconds
>> > > > >>
>> > > > >>
>> > > > >> hbase(main):034:0> scan 'tre', {NAME => 'cf1:c1', VERSIONS => 4}
>> > > > >>
>> > > > >> ROW
>> > >  COLUMN+CELL
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>  row1
>> > > >  column=cf1:c1,
>> > > > >> timestamp=113, value=abce
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> 1 row(s) in 0.0290 seconds
>> > > > >>
>> > > > >>
>> > > > >> On Thu, Mar 7, 2013 at 1:22 PM, Sergey Shelukhin <
>> > > > ser...@hortonworks.com
>> > > > >> >wrote:
>> > > > >>
>> > > > >> > Shouldn't you be able to insert delete with t belonging to (t2,
>> > t3)
>> > > to
>> > > > >> > achieve this effect?
>> > > > >> >
>> > > > >> > On Thu, Mar 7, 2013 at 10:12 AM, Jonathan Natkins <
>> > > na...@wibidata.com
>> > > > >> > >wrote:
>> > > > >> >
>> > > > >> > > Yep, that's the scenario I was curious about. Thanks!
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > On Thu, Mar 7, 2013 at 10:

Re: Hbase table with a nested entity

2013-02-27 Thread yonghu
Hello Dastgiri,

I don't think HBase can support original nested schema which you want
to define. But you can still store your data in HBase. I figured out
several possible solutions:

1. row_key: profileid + profilename + date, the column will be
monthwiseProfileCount:uk and so on. However, this approach will cause
data redundancy (profileid + profilename will repeatably appear), and
the data which belongs to the same user are separated into different
tuples.

2. row_key: profileid + profilename. the column will be
monthwiseProfileCount: date(e.g. 12/10/2010)/uk and so on. The benefit
of this approach is that all the data belongs to the same user group
together. However, as date is one part of column information. It will
cause to create many columns if the value range of date is wide.

regards!

Yong

On Wed, Feb 27, 2013 at 5:38 AM, Dastagiri S Shaik
 wrote:
> Hi All,
>
> I need to define a schema
>
> profileid  (integer)
> profilename (String)
> monthwiseProfileCount (is having )
> 12/10/2010-->
> uk:200
> us:300
> india:500
>
> 12/11/2010-->
> uk:200
> us:300
> india:500
>
>
> please help me.
>
> Regards
> Dastgiri
>
>


Re: coprocessor enabled put very slow, help please~~~

2013-02-18 Thread yonghu
Ok. Now, I got your point. I didn't notice the "checkAndPut".

regards!

Yong

On Mon, Feb 18, 2013 at 1:11 PM, Michael Segel
 wrote:
>
> The  issue I was talking about was the use of a check and put.
> The OP wrote:
>>>>> each map inserts to doc table.(checkAndPut)
>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>> a index table.
>
> My question is why does the OP use a checkAndPut, and the RegionObserver's 
> postChecAndPut?
>
>
> Here's a good example... 
> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>
> The OP doesn't really get in to the use case, so we don't know why the Check 
> and Put in the M/R job.
> He should just be using put() and then a postPut().
>
> Another issue... since he's writing to  a different HTable... how? Does he 
> create an HTable instance in the start() method of his RO object and then 
> reference it later? Or does he create the instance of the HTable on the fly 
> in each postCheckAndPut() ?
> Without seeing his code, we don't know.
>
> Note that this is synchronous set of writes. Your overall return from the M/R 
> call to put will wait until the second row is inserted.
>
> Interestingly enough, you may want to consider disabling the WAL on the write 
> to the index.  You can always run a M/R job that rebuilds the index should 
> something occur to the system where you might lose the data.  Indexes *ARE* 
> expendable. ;-)
>
> Does that explain it?
>
> -Mike
>
> On Feb 18, 2013, at 4:57 AM, yonghu  wrote:
>
>> Hi, Michael
>>
>> I don't quite understand what do you mean by "round trip back to the
>> client". In my understanding, as the RegionServer and TaskTracker can
>> be the same node, MR don't have to pull data into client and then
>> process.  And you also mention the "unnecessary overhead", can you
>> explain a little bit what operations or data processing can be seen as
>> "unnecessary overhead".
>>
>> Thanks
>>
>> yong
>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>  wrote:
>>> Why?
>>>
>>> This seems like an unnecessary overhead.
>>>
>>> You are writing code within the coprocessor on the server.  Pessimistic 
>>> code really isn't recommended if you are worried about performance.
>>>
>>> I have to ask... by the time you have executed the code in your 
>>> co-processor, what would cause the initial write to fail?
>>>
>>>
>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel  wrote:
>>>
>>>> its a local read. i just check the last param of PostCheckAndPut 
>>>> indicating if the Put succeeded. Incase if the put success, i insert a row 
>>>> in another table
>>>>
>>>> Sincerely,
>>>> Prakash Kadel
>>>>
>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan  wrote:
>>>>
>>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>>>> LSM, read is much slower compared to a write...
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Wei
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> From:   Prakash Kadel 
>>>>> To: "user@hbase.apache.org" ,
>>>>> Date:   02/17/2013 07:49 PM
>>>>> Subject:coprocessor enabled put very slow, help please~~~
>>>>>
>>>>>
>>>>>
>>>>> hi,
>>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>>> enable quick search of docs i want to have some indexes, so i tried to use
>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>> coprocessors not supposed to increase the latency?
>>>>> my settings:
>>>>>  3 region servers
>>>>> 60 maps
>>>>> each map inserts to doc table.(checkAndPut)
>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>> a index table.
>>>>>
>>>>>
>>>>> Sincerely,
>>>>> Prakash
>>>>>
>>>>
>>>
>>> Michael Segel  | (m) 312.755.9623
>>>
>>> Segel and Associates
>>>
>>>
>>
>


Re: coprocessor enabled put very slow, help please~~~

2013-02-18 Thread yonghu
Hi, Michael

I don't quite understand what do you mean by "round trip back to the
client". In my understanding, as the RegionServer and TaskTracker can
be the same node, MR don't have to pull data into client and then
process.  And you also mention the "unnecessary overhead", can you
explain a little bit what operations or data processing can be seen as
"unnecessary overhead".

Thanks

yong
On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
 wrote:
> Why?
>
> This seems like an unnecessary overhead.
>
> You are writing code within the coprocessor on the server.  Pessimistic code 
> really isn't recommended if you are worried about performance.
>
> I have to ask... by the time you have executed the code in your co-processor, 
> what would cause the initial write to fail?
>
>
> On Feb 18, 2013, at 3:01 AM, Prakash Kadel  wrote:
>
>> its a local read. i just check the last param of PostCheckAndPut indicating 
>> if the Put succeeded. Incase if the put success, i insert a row in another 
>> table
>>
>> Sincerely,
>> Prakash Kadel
>>
>> On Feb 18, 2013, at 2:52 PM, Wei Tan  wrote:
>>
>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>> LSM, read is much slower compared to a write...
>>>
>>>
>>> Best Regards,
>>> Wei
>>>
>>>
>>>
>>>
>>> From:   Prakash Kadel 
>>> To: "user@hbase.apache.org" ,
>>> Date:   02/17/2013 07:49 PM
>>> Subject:coprocessor enabled put very slow, help please~~~
>>>
>>>
>>>
>>> hi,
>>>  i am trying to insert few million documents to hbase with mapreduce. To
>>> enable quick search of docs i want to have some indexes, so i tried to use
>>> the coprocessors, but they are slowing down my inserts. Arent the
>>> coprocessors not supposed to increase the latency?
>>> my settings:
>>>   3 region servers
>>>  60 maps
>>> each map inserts to doc table.(checkAndPut)
>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>> a index table.
>>>
>>>
>>> Sincerely,
>>> Prakash
>>>
>>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>


Re: coprocessor enabled put very slow, help please~~~

2013-02-18 Thread yonghu
Forget to say. I also tested MapReduce. It's faster than coprocessor.

On Mon, Feb 18, 2013 at 10:01 AM, yonghu  wrote:
> Parkash,
>
> I have a six nodes cluster and met the same problem as you had. In my
> test, inserting one tuple using coprocessor is nearly 10 times slower
> than normal put operation. I think the main reason is what Lars
> pointed out, the main overhead is executing RPC.
>
> regards!
>
> Yong
>
> On Mon, Feb 18, 2013 at 6:52 AM, Wei Tan  wrote:
>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>> LSM, read is much slower compared to a write...
>>
>>
>> Best Regards,
>> Wei
>>
>>
>>
>>
>> From:   Prakash Kadel 
>> To: "user@hbase.apache.org" ,
>> Date:   02/17/2013 07:49 PM
>> Subject:coprocessor enabled put very slow, help please~~~
>>
>>
>>
>> hi,
>>i am trying to insert few million documents to hbase with mapreduce. To
>> enable quick search of docs i want to have some indexes, so i tried to use
>> the coprocessors, but they are slowing down my inserts. Arent the
>> coprocessors not supposed to increase the latency?
>> my settings:
>> 3 region servers
>>60 maps
>> each map inserts to doc table.(checkAndPut)
>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>> a index table.
>>
>>
>> Sincerely,
>> Prakash
>>


Re: coprocessor enabled put very slow, help please~~~

2013-02-18 Thread yonghu
Parkash,

I have a six nodes cluster and met the same problem as you had. In my
test, inserting one tuple using coprocessor is nearly 10 times slower
than normal put operation. I think the main reason is what Lars
pointed out, the main overhead is executing RPC.

regards!

Yong

On Mon, Feb 18, 2013 at 6:52 AM, Wei Tan  wrote:
> Is your CheckAndPut involving a local or remote READ? Due to the nature of
> LSM, read is much slower compared to a write...
>
>
> Best Regards,
> Wei
>
>
>
>
> From:   Prakash Kadel 
> To: "user@hbase.apache.org" ,
> Date:   02/17/2013 07:49 PM
> Subject:coprocessor enabled put very slow, help please~~~
>
>
>
> hi,
>i am trying to insert few million documents to hbase with mapreduce. To
> enable quick search of docs i want to have some indexes, so i tried to use
> the coprocessors, but they are slowing down my inserts. Arent the
> coprocessors not supposed to increase the latency?
> my settings:
> 3 region servers
>60 maps
> each map inserts to doc table.(checkAndPut)
> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
> a index table.
>
>
> Sincerely,
> Prakash
>


Re: Is it possible to indicate the column scan order when scanning table?

2013-02-07 Thread yonghu
Thanks for your response. I will take a look.

yong
On Thu, Feb 7, 2013 at 10:11 PM, Ted Yu  wrote:
> Yonghu:
> You may want to take a look at HBASE-5416: Improve performance of scans
> with some kind of filters.
> It would be in the upcoming 0.94.5 release.
>
> You can designate an essential column family. Based on the result from this
> column family, extra column family can be scanned.
>
> Cheers
>
> On Thu, Feb 7, 2013 at 1:07 PM, Sergey Shelukhin 
> wrote:
>
>> CFs are scanned in parallel in HBASE, and each row is built; scanning
>> entire CF and then building rows by scanning entire different CF wouldn't
>> scale very well.
>> Do you filter data on ttl column family?
>>
>> On Thu, Feb 7, 2013 at 12:01 PM, yonghu  wrote:
>>
>> > Like a table can contain ttl data and static data without indicating
>> > ttl. So, I want to first scan the columns which have ttl restrictions
>> > and later the static columns. The goal that I want to achieve is to
>> > reduce the data missing due to ttl expiration during the scan.
>> >
>> > regards!
>> >
>> > Yong
>> >
>> > On Thu, Feb 7, 2013 at 6:29 PM, Ted Yu  wrote:
>> > > Can you give us the use case where the scanning order is significant ?
>> > >
>> > > Thanks
>> > >
>> > > On Thu, Feb 7, 2013 at 9:23 AM, yonghu  wrote:
>> > >
>> > >> Dear all,
>> > >>
>> > >> I wonder if it is possible to indicate the column scan order when
>> > >> scanning table. For example, if I have two column families cf1 and cf2
>> > >> and I create a scan object. Is the table scanning order of
>> > >> scan.addFamily(cf1) and   scan.addFamily(cf2) is as same as
>> > >> scan.addFamily(cf2) and scan.addFamily(cf1)? If it's the same order,
>> > >> is it possible to indicate the scanning order of table?
>> > >>
>> > >> regards!
>> > >>
>> > >> Yong
>> > >>
>> >
>>


Re: Is it possible to indicate the column scan order when scanning table?

2013-02-07 Thread yonghu
Like a table can contain ttl data and static data without indicating
ttl. So, I want to first scan the columns which have ttl restrictions
and later the static columns. The goal that I want to achieve is to
reduce the data missing due to ttl expiration during the scan.

regards!

Yong

On Thu, Feb 7, 2013 at 6:29 PM, Ted Yu  wrote:
> Can you give us the use case where the scanning order is significant ?
>
> Thanks
>
> On Thu, Feb 7, 2013 at 9:23 AM, yonghu  wrote:
>
>> Dear all,
>>
>> I wonder if it is possible to indicate the column scan order when
>> scanning table. For example, if I have two column families cf1 and cf2
>> and I create a scan object. Is the table scanning order of
>> scan.addFamily(cf1) and   scan.addFamily(cf2) is as same as
>> scan.addFamily(cf2) and scan.addFamily(cf1)? If it's the same order,
>> is it possible to indicate the scanning order of table?
>>
>> regards!
>>
>> Yong
>>


Re: Json+hbase

2013-02-04 Thread yonghu
I think you can treat id as row key. and address-type/home or
address-type/office as column family. each address can be treated as
column. The thing is how you can transform your json metadata into
Hbase schema information.

regards!

Yong

On Mon, Feb 4, 2013 at 11:28 AM,   wrote:
> Hi,
>
>
> Need to create the data as given below in JSON Object and put the
> data into Hbase table using java
>
> ID  Address TypeAddress1Address2Address3
>
> 1   homex   x   x
> office  y   y   y
>
>
> How to put Json object in hbase table. Please guide me
>
> Thanks in advance.
>
>
>
> This e-Mail may contain proprietary and confidential information and is sent 
> for the intended recipient(s) only.  If by an addressing or transmission 
> error this mail has been misdirected to you, you are requested to delete this 
> mail immediately. You are also hereby notified that any use, any form of 
> reproduction, dissemination, copying, disclosure, modification, distribution 
> and/or publication of this e-mail message, contents or its attachment other 
> than by its intended recipient/s is strictly prohibited.
>
> Visit us at http://www.polarisFT.com


Re: Is there a way to close automatically log deletion in HBase?

2013-02-02 Thread yonghu
Thanks for your useful information. By the way, can you give me an
example about how I can define the customized logcleaner class.

regards!

Yong

On Sat, Feb 2, 2013 at 1:54 PM, ramkrishna vasudevan
 wrote:
> Ya i think that is the property that controls the archiving of the logs.
>  So the other properties are wrt Log Rolling and no of logs that is needed
> before rolling happens.
> But these parameters determine the time taken for recovery in case of RS
> failure.
>
> Your purpose is to maintain the logs so that they dont get deleted?
>
> IF you are using 0.94 version then you have a way to plugin your own
> logcleaner class.
> 'BaseLogCleanerDelegate' is the default thing available.  Customise your
> logcleaner as per your requirement so that you can have a back up of the
> logs.
>
> Hope this helps.
>
> Regards
> Ram
>
>
> On Sat, Feb 2, 2013 at 3:39 PM, yonghu  wrote:
>
>> Dear Ram,
>>
>> Thanks for your response.
>>
>> Yes Log means WAL logs. Now, I can define a postWALWrite trigger to
>> write the WAL entries into the specific file as I want, however, it's
>> too slow.  The problem is that hbase automatically delete log entries.
>> However, I found some properties such as hbase.master.logcleaner.ttl
>> which I can use to indicate how long logs can exist. I think it can
>> control the life cycle of the logs. If you know other opions, please
>> let me know.
>>
>> Thanks!
>>
>> Yong
>>
>> On Sat, Feb 2, 2013 at 10:30 AM, ramkrishna vasudevan
>>  wrote:
>> > Logs of hbase you mean the normal logging or the WAL logs.  Sorry am not
>> > getting your question here.
>> > WAL trigger is for the WAL logs.
>> >
>> > Regards
>> > Ram
>> >
>> > On Sat, Feb 2, 2013 at 1:31 PM, yonghu  wrote:
>> >
>> >> Hello,
>> >>
>> >> For some reasons, I need to analyze the log of hbase. However, the log
>> >> will be automatically deleted by GC of JVM (if I understand right). I
>> >> wonder if there is a way to close automatically log deletion in HBase.
>> >> By the way, I can collect Log by WAL trigger, but it's really slow.
>> >>
>> >> regards!
>> >>
>> >> Yong
>> >>
>>


Re: Is there a way to close automatically log deletion in HBase?

2013-02-02 Thread yonghu
Dear Ram,

Thanks for your response.

Yes Log means WAL logs. Now, I can define a postWALWrite trigger to
write the WAL entries into the specific file as I want, however, it's
too slow.  The problem is that hbase automatically delete log entries.
However, I found some properties such as hbase.master.logcleaner.ttl
which I can use to indicate how long logs can exist. I think it can
control the life cycle of the logs. If you know other opions, please
let me know.

Thanks!

Yong

On Sat, Feb 2, 2013 at 10:30 AM, ramkrishna vasudevan
 wrote:
> Logs of hbase you mean the normal logging or the WAL logs.  Sorry am not
> getting your question here.
> WAL trigger is for the WAL logs.
>
> Regards
> Ram
>
> On Sat, Feb 2, 2013 at 1:31 PM, yonghu  wrote:
>
>> Hello,
>>
>> For some reasons, I need to analyze the log of hbase. However, the log
>> will be automatically deleted by GC of JVM (if I understand right). I
>> wonder if there is a way to close automatically log deletion in HBase.
>> By the way, I can collect Log by WAL trigger, but it's really slow.
>>
>> regards!
>>
>> Yong
>>


How can I set column information when I use YCSB to test HBase?

2013-01-18 Thread yonghu
Dear all,

I read the information of
https://github.com/brianfrankcooper/YCSB/wiki/Running-a-Workload

For example, I can indicate the column family name when I issue the
command line as
java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -load -db
com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p
columnfamily=family -p recordcount=1000 -s > load.dat

My question is that where I can set column information, such like column name?

thanks!

Yong


Re: Hbase Question

2012-12-28 Thread yonghu
I think you can take a look at your row-key design and evenly
distribute your data in your cluster, as you mentioned even if you
added more nodes, there was no improvement of performance. Maybe you
have a node who is a hot spot, and the other nodes have no work to do.

regards!

Yong

On Tue, Dec 25, 2012 at 3:31 AM, 周梦想  wrote:
> Hi Dalia,
>
> I think you can make a small sample of the table to do the test, then
> you'll find what's the difference of scan and count.
> because you can count it by human.
>
> Best regards,
> Andy
>
> 2012/12/24 Dalia Sobhy 
>
>>
>> Dear all,
>>
>> I have 50,000 row with diagnosis qualifier = "cardiac", and another 50,000
>> rows with "renal".
>>
>> When I type this in Hbase shell,
>>
>> import org.apache.hadoop.hbase.filter.CompareFilter
>> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
>> import org.apache.hadoop.hbase.filter.SubstringComparator
>> import org.apache.hadoop.hbase.util.Bytes
>>
>> scan 'patient', { COLUMNS => "info:diagnosis", FILTER =>
>> SingleColumnValueFilter.new(Bytes.toBytes('info'),
>>  Bytes.toBytes('diagnosis'),
>>  CompareFilter::CompareOp.valueOf('EQUAL'),
>>  SubstringComparator.new('cardiac'))}
>>
>> Output = 50,000 row
>>
>> import org.apache.hadoop.hbase.filter.CompareFilter
>> import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
>> import org.apache.hadoop.hbase.filter.SubstringComparator
>> import org.apache.hadoop.hbase.util.Bytes
>>
>> count 'patient', { COLUMNS => "info:diagnosis", FILTER =>
>> SingleColumnValueFilter.new(Bytes.toBytes('info'),
>>  Bytes.toBytes('diagnosis'),
>>  CompareFilter::CompareOp.valueOf('EQUAL'),
>>  SubstringComparator.new('cardiac'))}
>> Output = 100,000 row
>>
>> Even though I tried it using Hbase Java API, Aggregation Client Instance,
>> and I enabled the Coprocessor aggregation for the table.
>> rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)
>>
>> Also when measuring the improved performance on case of adding more nodes
>> the operation takes the same time.
>>
>> So any advice please?
>>
>> I have been throughout all this mess from a couple of weeks
>>
>> Thanks,
>>
>>
>>
>>


Re: Coprocessor slow down problem!

2012-12-02 Thread yonghu
Hi Anoop,

Can you give me a more detailed information about how can I take a
look at RS threads doing?

Thanks!

Yong

On Sat, Dec 1, 2012 at 9:23 AM, ramkrishna vasudevan
 wrote:
> Ok...fine...Ya seeing what is happening in postPut should give an idea.
>
> Regards
> Ram
>
> On Sat, Dec 1, 2012 at 1:52 PM, Anoop John  wrote:
>
>> Ram,  This issue was for prePut()..postPut() was fine only...
>> Can you take a look that at the time of slow put what the corresponding RS
>> threads doing.
>> May be can get some clues from that.
>>
>> -Anoop-
>>
>>
>> On Fri, Nov 30, 2012 at 2:04 PM, ramkrishna vasudevan <
>> ramkrishna.s.vasude...@gmail.com> wrote:
>>
>> > Hi
>> >
>> > Pls check if this issue is similar to HBASE-5897.  It is fixed in 0.92.2
>> as
>> > i see from the fix versions.
>> >
>> > Regards
>> > Ram
>> >
>> > On Fri, Nov 30, 2012 at 1:13 PM, yonghu  wrote:
>> >
>> > > Dear all,
>> > >
>> > > I have two tables, named test and tracking. For every put in the test
>> > > table, I defined a postPut trigger to insert the same data into
>> > > tracking table. I tested about 50 thousands tuples. The strange thing
>> > > for me is for the first 20 thousands tuples, the speed is very fast (I
>> > > made 5 times tests, every time is the same). And then, after that, the
>> > > insertion to tracking table will be every slow, probably 2-3 rows/sec.
>> > > My hbase version is 0.92.0. Can anyone tell me why and how I can
>> > > increase the speed?
>> > >
>> > > regards!
>> > >
>> > > Yong
>> > >
>> >
>>


Re: Major compactions not firing

2012-11-29 Thread yonghu
How do you close the major compaction and what hbase version do you use?


On Fri, Nov 30, 2012 at 8:51 AM, Varun Sharma  wrote:
> I see nothing like major compaction in the logs of the region server or the
> master...
>
> On Thu, Nov 29, 2012 at 11:46 PM, yonghu  wrote:
>
>> Do you check your log infor? As far as I understood, if there is a
>> major compaction, this event will be recorded in log.
>>
>> regards!
>>
>> Yong
>>
>> On Fri, Nov 30, 2012 at 8:41 AM, Varun Sharma  wrote:
>> > Hi,
>> >
>> > I turned off automatic major compactions and tried to major compact all
>> > regions via both the region server interface and doing a major_compact on
>> > the table from the shell. Nothing seems to happen. There are several
>> > regions with 5 store files and I have only 2 column families, hours
>> after I
>> > have fired the command.
>> >
>> > Is there something obvious I am missing. Is there a way to get more
>> > diagnostics for the same ?
>> >
>> > Thanks
>> > Varun
>>


Re: Major compactions not firing

2012-11-29 Thread yonghu
Do you check your log infor? As far as I understood, if there is a
major compaction, this event will be recorded in log.

regards!

Yong

On Fri, Nov 30, 2012 at 8:41 AM, Varun Sharma  wrote:
> Hi,
>
> I turned off automatic major compactions and tried to major compact all
> regions via both the region server interface and doing a major_compact on
> the table from the shell. Nothing seems to happen. There are several
> regions with 5 store files and I have only 2 column families, hours after I
> have fired the command.
>
> Is there something obvious I am missing. Is there a way to get more
> diagnostics for the same ?
>
> Thanks
> Varun


Re: Column family names and data size on disk

2012-11-28 Thread yonghu
I like the illustration of Stack.

regards!

Yong

On Wed, Nov 28, 2012 at 6:56 PM, Stack  wrote:
> On Wed, Nov 28, 2012 at 6:40 AM, matan  wrote:
>
>> Why does the CF have to be in the HFile, isn't the entire HFile dedicated
>> to
>> just one CF to start with (I'm speaking at the HBase architecture level,
>> trying to figure why it is working as like it is).
>>
>
>
> The CF is repeated in each key because it was thought that one day an hfile
> could hold keyvalues from multiple CFs; e.g. if hbase implemented locality
> groups.
>
> St.Ack


Re: Can we insert into Hbase without specifying the column name?

2012-11-26 Thread yonghu
Hi Rams,

yes. You can. See follows:

hbase(main):001:0> create 'test1','course'
0 row(s) in 1.6760 seconds

hbase(main):002:0> put 'test1','tom','course',90
0 row(s) in 0.1040 seconds

hbase(main):003:0> scan 'test1'
ROW   COLUMN+CELL
 tom  column=course:, timestamp=1353925674312, value=90
1 row(s) in 0.0440 seconds

regards!

Yong

On Mon, Nov 26, 2012 at 11:25 AM, Ramasubramanian Narayanan
 wrote:
> Hi,
>
> Is it possible to insert into Hbase without specifying the column name
> instead using the Column family name alone (assume there will not be any
> field created for that column family name)
>
> regards,
> Rams


Re: Log files occupy lot of Disk size

2012-11-23 Thread yonghu
I think you can set setWriteToWAL() method to false to reduce the
amount of log infor. But you may get risks when your cluster is down.

regards!
yong

On Fri, Nov 23, 2012 at 7:58 AM, iwannaplay games
 wrote:
> Hi,
>
> Everytime i query hbase or hive ,there is a significant growth in my
> log files and it consumes lot of space from my hard disk(Approx 40
> gb)
> So i stop the cluster ,delete all the logs and free the space and then
> again start the cluster to start my work.
>
> Is there any other solution coz i cannot restart the cluster everyday.


Re: why reduce doesn't work by HBase?

2012-11-16 Thread yonghu
Thanks for your suggestions

On Thu, Nov 15, 2012 at 9:56 PM, Bertrand Dechoux  wrote:
> Hi Yong,
>
> A few ways to understand a problem (with Hadoop) :
> 1) Write tests, using MRUnit for example (http://mrunit.apache.org/)
> 2) Use @Override to make sure you are indeed overriding a method
> 3) Use a break point while debugging
>
> The answer to your current problem : o.a.h.mapreduce.Reducer method has no
> Iterator parameter but it does have a Iterable parameter...
>
> Regards
>
> Bertrand
>
> PS : It is absolutely not related to HBase.
>
> On Thu, Nov 15, 2012 at 8:42 PM, yonghu  wrote:
>
>> Dear all,
>>
>> I use HBase as data source and hdfs as data sink. I wrote the program
>> which calculate the versions for each cell as follows:
>>
>> public class ReturnMore { //only Map side has no problem, has already
>> returned 3 versions
>>
>>public static class Map extends TableMapper{
>>   private Text outkey = new Text();
>>   private Text outval = new Text();
>>   public void map(ImmutableBytesWritable key, Result values, Context
>> context){
>>  KeyValue[] kvs = values.raw();
>>  String row_key;
>>  String col_fam;
>>  String col;
>>  String val;
>>  String finalKey;
>>  String finalValue;
>>  for(KeyValue kv : kvs){
>> row_key = Bytes.toString(kv.getRow());
>> col_fam = Bytes.toString(kv.getFamily());
>> col = Bytes.toString(kv.getQualifier());
>> val = Bytes.toString(kv.getValue());
>> finalKey = row_key + "/" + col_fam + "/" + col;
>> finalValue = new Long(kv.getTimestamp()).toString() + "/" +
>> val;
>> outkey.set(finalKey);
>> outval.set(finalValue);
>> try {
>>context.write(outkey, outval);
>> } catch (IOException e) {
>>e.printStackTrace();
>> } catch (InterruptedException e) {
>>e.printStackTrace();
>> }
>>  }
>>   }
>>}
>>
>>public static class Reduce extends Reducer> IntWritable>{ //the desired output is key + 3
>>   private IntWritable occ = new IntWritable();
>>   public void reduce(Text key, Iterator values, Context
>> context){
>>  int i = 0;
>>  while(values.hasNext()){
>> i++;
>>  }
>>  occ.set(i);
>>  try {
>> context.write(key, occ);
>>  } catch (IOException e) {
>> e.printStackTrace();
>>  } catch (InterruptedException e) {
>> e.printStackTrace();
>>  }
>>   }
>>}
>>public static void main(String[] args) throws Exception{
>>   Configuration conf = HBaseConfiguration.create();
>>   Job job = new Job(conf);
>>   job.setJarByClass(ReturnMore.class);
>>   Scan scan = new Scan();
>>   scan.setMaxVersions();
>>   job.setReducerClass(Reduce.class);
>>   job.setOutputKeyClass(Text.class);
>>   job.setOutputValueClass(IntWritable.class);
>>   job.setNumReduceTasks(1);
>>   TableMapReduceUtil.initTableMapperJob("test", scan, Map.class,
>> Text.class, Text.class, job);
>>   FileSystem file = FileSystem.get(conf);
>>   Path path = new Path("/hans/test/");
>>   if(file.exists(path)){
>>  file.delete(path,true);
>>   }
>>   FileOutputFormat.setOutputPath(job, path);
>>   System.exit(job.waitForCompletion(true)?0:1);
>>}
>> }
>>
>> I tested this code in both hbase 0.92.1 and 0.94. However, if I run
>> this code, it always outputs the content for each cell not as the
>> output as I defined in reduce function (key + occurrence for each
>> cell). Can anyone give me advices? By the way, I run it on
>> pseudo-mode.
>>
>> regards!
>>
>> Yong
>>
>
>
>
> --
> Bertrand Dechoux


Re: Why Regionserver is not serving when I set the WAL trigger?

2012-11-12 Thread yonghu
The problem is caused by the code, as I create the configuration by
myself. The correct way is to use getConfiguration() method.

regards!

Yong

On Sat, Nov 10, 2012 at 4:12 PM, ramkrishna vasudevan
 wrote:
> Sorry i am not very sure if there is any link between the coprocessor and
> region not online.
> Pls check if your META region is online.
>
> Regards
> ram
>
> On Sat, Nov 10, 2012 at 8:37 PM, yonghu  wrote:
>
>> Dear All,
>>
>> I used hbase 0.94.1 and implemented the test example of WAL trigger like:
>>
>> public class WalTrigger extends BaseRegionObserver implements WALObserver{
>>
>> public boolean
>> preWALWrite(ObserverContext
>> ctx, HRegionInfo info, HLogKey logKey, WALEdit logEdit) throws
>> IOException{
>> Configuration conf = new Configuration();
>> String key = logKey.toString();
>> String value = logEdit.toString();
>> String logRes = key + value;
>> HTable table = new HTable(conf,"log");
>> Put put = new Put(Bytes.toBytes(key));
>> put.add(Bytes.toBytes("logEntry"), null,
>> Bytes.toBytes(value));
>> table.put(put);
>> return true;
>> }
>> public void postWALWrite(ObserverContext
>> ctx, HRegionInfo info, HLogKey logKey, WALEdit logEdit) throws
>> IOException{
>>
>> }
>> }
>>
>> However, when I inserted the tuples in HBase, it returns Exception in
>> thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException:
>> Failed after attempts=10, exceptions:
>> Sat Nov 10 15:58:14 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:15 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:16 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:17 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:19 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:21 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:25 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:29 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:37 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>> Sat Nov 10 15:58:54 CET 2012,
>> org.apache.hadoop.hbase.client.ScannerCallable@1ff92f5,
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region is not
>> online: .META.,,1
>>
>> at
>> org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:183)
>> at
>> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:205)
>> at
>> org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:120)
>> at
>> org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:626)
>> at
>> org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:707)

Re: A question of storage structure for memstore?

2012-10-22 Thread yonghu
Thanks for your responses.

yong


On Mon, Oct 22, 2012 at 3:05 PM, Anoop Sam John  wrote:
> To be precise there will be one memstore per family per region..
> If table having 2 CFs and there are 10 regions for that table then totally 
> 2*10=20 memstores..
>
> -Anoop-
> 
> From: Kevin O'dell [kevin.od...@cloudera.com]
> Sent: Monday, October 22, 2012 5:55 PM
> To: user@hbase.apache.org
> Subject: Re: A question of storage structure for memstore?
>
> Yes, there will be two memstores if you have two CFs.
> On Oct 22, 2012 7:25 AM, "yonghu"  wrote:
>
>> Dear All,
>>
>> In the description it mentions that a Store file (per column family)
>> is composed of one memstore and a set of HFiles. Does it imply that
>> for every column family there is a corresponding memstore? For
>> example. if a table has 2 column families, there will be 2 memstores
>> in memory?
>>
>> regards!
>>
>> Yong
>>


Re: Can I use coprocessor to record the deleted data caused by ttl?

2012-09-01 Thread yonghu
I want to record the data change from the source table. Thanks for your apply.

regards!

Yong

On Sat, Sep 1, 2012 at 1:01 AM, lars hofhansl  wrote:
> Yes (in 0.94.2+). But it would be quite tricky.
> You'd have to hook into the compaction. There's a new hook now in 
> RegionObserver (preCompactionScannerOpen, and preFlushScannerOpen).
> See HBASE-6427.
>
> These two hooks are passed the scanners that provide the set of KVs to be 
> compacted. You could wrap that scanner in your own.
> The details will be tricky, though.
>
> What is your usecase for this?
>
> -- Lars
>
>
>
> 
>  From: yonghu 
> To: user@hbase.apache.org
> Sent: Friday, August 31, 2012 1:13 PM
> Subject: Can I use coprocessor to record the deleted data caused by ttl?
>
> Dear All,
>
> I wonder if I can use coprocessor to record the deleted data caused by
> ttl. Any ideas?
>
> regards!
>
> Yong


Re: What happened in hlog if data are deleted cuased by ttl?

2012-08-22 Thread yonghu
Sorry for that. I didn't use the right parameter. Now I get the point.

regards!

Yong

On Wed, Aug 22, 2012 at 10:49 AM, Harsh J  wrote:
> Hey Yonghu,
>
> You are right that TTL "deletions" (it isn't exactly a delete, its
> more of a compact-time skip wizardry) do not go to the HLog as
> "events". Know that TTLs aren't applied "per-cell", they are applied
> on the whole CF globally. So there is no such thing as a TTL-write or
> a TTL-delete event. In fact, the Region-level Coprocessor too has no
> hooks for "TTL-events", as seen at
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html,
> for this doesn't happen on triggers.
>
> What you say about the compaction part is wrong however. Compaction
> too runs a regular store-file scanner to compact, and so does the
> regular Scan operation, to read (Both use the same file scanning
> mechanism/code). So there's no difference in how compact or a client
> scan handle TTL-expired row values from a store file, when reading it
> up.
>
> I also am not able to understand what your sample shell command list
> shows. As I see it, its shown that the HFile did have the entry in it
> after you had flushed it. Note that you mentioned the TTL at the CF
> level when creating the table, not in your "put" statement, and this
> is a vital point in understanding how TTLs work.
>
> On Wed, Aug 22, 2012 at 1:49 PM, yonghu  wrote:
>> I can fully understand normal deletion. But, in my point of view, ttl
>> deletion is different than the normal deletion. The insertion of ttl
>> data is recorded in hlog. But the ttl deletion is not recorded by
>> hlog. So, it failure occurs, should the ttl data be reinserted to data
>> or can we discard the certain ttl data? Moreover, ttl deletion is not
>> executed at data compaction time. Scanner needs to periodically scan
>> each Store file to execute deletion.
>>
>> regards!
>>
>> Yong
>>
>>
>>
>> On Tue, Aug 21, 2012 at 5:29 PM, jmozah  wrote:
>>> This helped me 
>>> http://hadoop-hbase.blogspot.in/2011/12/deletion-in-hbase.html
>>>
>>>
>>> ./Zahoor
>>> HBase Musings
>>>
>>>
>>> On 14-Aug-2012, at 6:54 PM, Harsh J  wrote:
>>>
>>>> Hi Yonghu,
>>>>
>>>> A timestamp is stored along with each insert. The ttl is maintained at
>>>> the region-store level. Hence, when the log replays, all entries with
>>>> expired TTLs are automatically omitted.
>>>>
>>>> Also, TTL deletions happen during compactions, and hence do not
>>>> carry/need Delete events. When scanning a store file, TTL-expired
>>>> entries are automatically skipped away.
>>>>
>>>> On Tue, Aug 14, 2012 at 3:34 PM, yonghu  wrote:
>>>>> My hbase version is 0.92. I tried something as follows:
>>>>> 1.Created a table 'test' with 'course' in which ttl=5.
>>>>> 2. inserted one row into the table. 5 seconds later, the row was deleted.
>>>>> Later when I checked the log infor of 'test' table, I only found the
>>>>> inserted information but not deleted information.
>>>>>
>>>>> Can anyone tell me which information is written into hlog when data is
>>>>> deleted by ttl or in this situation, no information is written into
>>>>> the hlog. If there is no information of deletion in the log, how can
>>>>> we guarantee the data recovered by log are correct?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Yong
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>
>
>
> --
> Harsh J


Re: What happened in hlog if data are deleted cuased by ttl?

2012-08-22 Thread yonghu
And also an interesting point is that the ttl data will not exist in
hfile. I have made the following test,

hbase(main):003:0> create 'test',{TTL=>'200',NAME=>'course'}
0 row(s) in 1.1420 seconds

hbase(main):005:0> put 'test','tom','course:english',90
0 row(s) in 0.0320 seconds

hbase(main):006:0> flush 'test'
0 row(s) in 0.1680 seconds

hbase(main):007:0> scan 'test'
ROW   COLUMN+CELL
 tom  column=course:english, timestamp=1345623867082, value=90
1 row(s) in 0.0350 seconds

./hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f
/hbase/test/abe4d5adaa650cdd46d26dca0bf85b72/course/8c77fb321f934592869f9852f777b22e
Scanning -> 
/hbase/test/abe4d5adaa650cdd46d26dca0bf85b72/course/8c77fb321f934592869f9852f777b22e
12/08/22 10:27:39 INFO hfile.CacheConfig: Allocating LruBlockCache
with maximum size 247.9m
Scanned kv count -> 1

so, I guess the ttl data is only managed in memstore. But the question
is that if memstore doesn't have enough size to accept new incoming
ttl data what will happen? Can anybody explain?

Thanks!

Yong
On Wed, Aug 22, 2012 at 10:19 AM, yonghu  wrote:
> I can fully understand normal deletion. But, in my point of view, ttl
> deletion is different than the normal deletion. The insertion of ttl
> data is recorded in hlog. But the ttl deletion is not recorded by
> hlog. So, it failure occurs, should the ttl data be reinserted to data
> or can we discard the certain ttl data? Moreover, ttl deletion is not
> executed at data compaction time. Scanner needs to periodically scan
> each Store file to execute deletion.
>
> regards!
>
> Yong
>
>
>
> On Tue, Aug 21, 2012 at 5:29 PM, jmozah  wrote:
>> This helped me http://hadoop-hbase.blogspot.in/2011/12/deletion-in-hbase.html
>>
>>
>> ./Zahoor
>> HBase Musings
>>
>>
>> On 14-Aug-2012, at 6:54 PM, Harsh J  wrote:
>>
>>> Hi Yonghu,
>>>
>>> A timestamp is stored along with each insert. The ttl is maintained at
>>> the region-store level. Hence, when the log replays, all entries with
>>> expired TTLs are automatically omitted.
>>>
>>> Also, TTL deletions happen during compactions, and hence do not
>>> carry/need Delete events. When scanning a store file, TTL-expired
>>> entries are automatically skipped away.
>>>
>>> On Tue, Aug 14, 2012 at 3:34 PM, yonghu  wrote:
>>>> My hbase version is 0.92. I tried something as follows:
>>>> 1.Created a table 'test' with 'course' in which ttl=5.
>>>> 2. inserted one row into the table. 5 seconds later, the row was deleted.
>>>> Later when I checked the log infor of 'test' table, I only found the
>>>> inserted information but not deleted information.
>>>>
>>>> Can anyone tell me which information is written into hlog when data is
>>>> deleted by ttl or in this situation, no information is written into
>>>> the hlog. If there is no information of deletion in the log, how can
>>>> we guarantee the data recovered by log are correct?
>>>>
>>>> Thanks!
>>>>
>>>> Yong
>>>
>>>
>>>
>>> --
>>> Harsh J
>>


Re: What happened in hlog if data are deleted cuased by ttl?

2012-08-22 Thread yonghu
I can fully understand normal deletion. But, in my point of view, ttl
deletion is different than the normal deletion. The insertion of ttl
data is recorded in hlog. But the ttl deletion is not recorded by
hlog. So, it failure occurs, should the ttl data be reinserted to data
or can we discard the certain ttl data? Moreover, ttl deletion is not
executed at data compaction time. Scanner needs to periodically scan
each Store file to execute deletion.

regards!

Yong



On Tue, Aug 21, 2012 at 5:29 PM, jmozah  wrote:
> This helped me http://hadoop-hbase.blogspot.in/2011/12/deletion-in-hbase.html
>
>
> ./Zahoor
> HBase Musings
>
>
> On 14-Aug-2012, at 6:54 PM, Harsh J  wrote:
>
>> Hi Yonghu,
>>
>> A timestamp is stored along with each insert. The ttl is maintained at
>> the region-store level. Hence, when the log replays, all entries with
>> expired TTLs are automatically omitted.
>>
>> Also, TTL deletions happen during compactions, and hence do not
>> carry/need Delete events. When scanning a store file, TTL-expired
>> entries are automatically skipped away.
>>
>> On Tue, Aug 14, 2012 at 3:34 PM, yonghu  wrote:
>>> My hbase version is 0.92. I tried something as follows:
>>> 1.Created a table 'test' with 'course' in which ttl=5.
>>> 2. inserted one row into the table. 5 seconds later, the row was deleted.
>>> Later when I checked the log infor of 'test' table, I only found the
>>> inserted information but not deleted information.
>>>
>>> Can anyone tell me which information is written into hlog when data is
>>> deleted by ttl or in this situation, no information is written into
>>> the hlog. If there is no information of deletion in the log, how can
>>> we guarantee the data recovered by log are correct?
>>>
>>> Thanks!
>>>
>>> Yong
>>
>>
>>
>> --
>> Harsh J
>


Re: What happened in hlog if data are deleted cuased by ttl?

2012-08-21 Thread yonghu
Thanks for your response. Can you tell me how the data is deleted due
to the ttl? Which module in HBase will trigger deletion? You mentioned
the scanner, does it mean the scanner will scan the store file
periodically and then deletes the data which expire?

regards!

Yong

On Thu, Aug 16, 2012 at 6:16 AM, Ramkrishna.S.Vasudevan
 wrote:
> Hi
>
> Just to add on,  The HLog is just an edit log.  Any transaction updates(
> Puts/Deletes) are just added to HLog.  It is the Scanner that takes care of
> the TTL part which is calculated from the TTL configured at the column
> family(Store) level.
>
> Regards
> Ram
>
>> -Original Message-
>> From: Harsh J [mailto:ha...@cloudera.com]
>> Sent: Tuesday, August 14, 2012 8:51 PM
>> To: user@hbase.apache.org
>> Subject: Re: What happened in hlog if data are deleted cuased by ttl?
>>
>> Yes, TTL deletions are done only during compactions. They aren't
>> "Deleted" in the sense of what a Delete insert signifies, but are
>> rather eliminated in the write process when new
>> storefiles are written out - if the value being written to the
>> compacted store has already expired.
>>
>> On Tue, Aug 14, 2012 at 8:40 PM, yonghu  wrote:
>> > Hi Hars,
>> >
>> > Thanks for your reply. If I understand you right, it means the ttl
>> > deletion will not reflect in log.
>> >
>> > On Tue, Aug 14, 2012 at 3:24 PM, Harsh J  wrote:
>> >> Hi Yonghu,
>> >>
>> >> A timestamp is stored along with each insert. The ttl is maintained
>> at
>> >> the region-store level. Hence, when the log replays, all entries
>> with
>> >> expired TTLs are automatically omitted.
>> >>
>> >> Also, TTL deletions happen during compactions, and hence do not
>> >> carry/need Delete events. When scanning a store file, TTL-expired
>> >> entries are automatically skipped away.
>> >>
>> >> On Tue, Aug 14, 2012 at 3:34 PM, yonghu 
>> wrote:
>> >>> My hbase version is 0.92. I tried something as follows:
>> >>> 1.Created a table 'test' with 'course' in which ttl=5.
>> >>> 2. inserted one row into the table. 5 seconds later, the row was
>> deleted.
>> >>> Later when I checked the log infor of 'test' table, I only found
>> the
>> >>> inserted information but not deleted information.
>> >>>
>> >>> Can anyone tell me which information is written into hlog when data
>> is
>> >>> deleted by ttl or in this situation, no information is written into
>> >>> the hlog. If there is no information of deletion in the log, how
>> can
>> >>> we guarantee the data recovered by log are correct?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Yong
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>>
>>
>>
>> --
>> Harsh J
>


Re: Put w/ timestamp -> Deleteall -> Put w/ timestamp fails

2012-08-15 Thread yonghu
Hi Harsh,

I have a question of your description. The deleted tag masks the new
inserted value with old timestamp, that's why the new inserted data
can'be seen. But after major compaction, this new value will be seen
again. So, the question is that how the deletion really executes. In
my understanding, the deletion will delete all the data values which
TSs are less equal than the TS of the deleted tag. So, if you insert a
value with old TS after you insert a deleted tag, it should also be
deleted at the  compaction time. For example, if I first insert
(k1,t1), and then delete  (k1,t1) with deleted tag which TS is greater
than t1, then reinsert (k1,t1) again. So, at the compaction time, two
(k1,t1) should be deleted.

wish your response!

Yong



On Wed, Aug 15, 2012 at 7:53 AM, Takahiko Kawasaki  wrote:
> Dear Harsh,
>
> Thank you very much for your detailed explanation. I could understand
> what had been going on during my put/scan/delete operations. I'll modify
> my application and test programs taking the timestamp implementation
> into consideration.
>
> Best Regards,
> Takahiko Kawasaki
>
> 2012/8/15 Harsh J 
>
>> When a Delete occurs, an insert is made with the timestamp being the
>> current time (to indicate it is the latest version). Hence, when you
>> insert a value after this with an _older_ timestamp, it is not taken
>> in as the latest version, and is hence ignored when scanning. This is
>> why you do not see the data.
>>
>> If you instead insert this after a compaction has fully run on this
>> store file, then your value will indeed get shown after insert, cause
>> at that moment there wouldn't exist such a row with a latest timestamp
>> at all.
>>
>> hbase(main):060:0> flush 'test-table'
>> 0 row(s) in 0.1020 seconds
>>
>> hbase(main):061:0> major_compact 'test-table'
>> 0 row(s) in 0.0400 seconds
>>
>> hbase(main):062:0> put 'test-table', 'row4', 'test-family', 'value', 10
>> 0 row(s) in 0.0230 seconds
>>
>> hbase(main):063:0> scan 'test-table'
>> ROW   COLUMN+CELL
>>  row4 column=test-family:, timestamp=10, value=value
>> 1 row(s) in 0.0060 seconds
>>
>> I suppose this is why it is recommended not to mess with the
>> timestamps manually, and instead just rely on versions.
>>
>> On Tue, Aug 14, 2012 at 8:24 PM, Takahiko Kawasaki 
>> wrote:
>> > Hello,
>> >
>> > I have a problem where 'put' with timestamp does not succeed.
>> > I did the following at the HBase shell.
>> >
>> > (1) Do 'put' with timestamp.
>> >   # 'scan' shows 1 row.
>> >
>> > (2) Delete the row by 'deleteall'.
>> >   # 'scan' says "0 row(s)".
>> >
>> > (3) Do 'put' again by the same command line as (1).
>> >   # 'scan' says "0 row(s)" ! Why?
>> >
>> > (4) Increment the timestamp value by 1 and try 'put' again.
>> >   # 'scan' still says "0 row(s)"! Why?
>> >
>> > The command lines I actually typed are as follows and the attached
>> > file is the output from the command lines.
>> >
>> > scan 'test-table'
>> > put 'test-table', 'row3', 'test-family', 'value'
>> > scan 'test-table'
>> > deleteall 'test-table', 'row3'
>> > scan 'test-table'
>> > put 'test-table', 'row3', 'test-family', 'value'
>> > scan 'test-table'
>> > deleteall 'test-table', 'row3'
>> > scan 'test-table'
>> > put 'test-table', 'row4', 'test-family', 'value', 10
>> > scan 'test-table'
>> > deleteall 'test-table', 'row4'
>> > scan 'test-table'
>> > put 'test-table', 'row4', 'test-family', 'value', 10
>> > scan 'test-table'
>> > put 'test-table', 'row4', 'test-family', 'value', 10
>> > scan 'test-table'
>> > quit
>> >
>> > Is this behavior the HBase specification?
>> >
>> > My cluster is built using CDH4 and the HBase version is 0.92.1-cdh4.0.0.
>> >
>> > Could anyone give me any insight, please?
>> >
>> > Best Regards,
>> > Takahiko Kawasaki
>>
>>
>>
>> --
>> Harsh J
>>


Re: What happened in hlog if data are deleted cuased by ttl?

2012-08-14 Thread yonghu
Hi Hars,

Thanks for your reply. If I understand you right, it means the ttl
deletion will not reflect in log.

On Tue, Aug 14, 2012 at 3:24 PM, Harsh J  wrote:
> Hi Yonghu,
>
> A timestamp is stored along with each insert. The ttl is maintained at
> the region-store level. Hence, when the log replays, all entries with
> expired TTLs are automatically omitted.
>
> Also, TTL deletions happen during compactions, and hence do not
> carry/need Delete events. When scanning a store file, TTL-expired
> entries are automatically skipped away.
>
> On Tue, Aug 14, 2012 at 3:34 PM, yonghu  wrote:
>> My hbase version is 0.92. I tried something as follows:
>> 1.Created a table 'test' with 'course' in which ttl=5.
>> 2. inserted one row into the table. 5 seconds later, the row was deleted.
>> Later when I checked the log infor of 'test' table, I only found the
>> inserted information but not deleted information.
>>
>> Can anyone tell me which information is written into hlog when data is
>> deleted by ttl or in this situation, no information is written into
>> the hlog. If there is no information of deletion in the log, how can
>> we guarantee the data recovered by log are correct?
>>
>> Thanks!
>>
>> Yong
>
>
>
> --
> Harsh J


What happened in hlog if data are deleted cuased by ttl?

2012-08-14 Thread yonghu
My hbase version is 0.92. I tried something as follows:
1.Created a table 'test' with 'course' in which ttl=5.
2. inserted one row into the table. 5 seconds later, the row was deleted.
Later when I checked the log infor of 'test' table, I only found the
inserted information but not deleted information.

Can anyone tell me which information is written into hlog when data is
deleted by ttl or in this situation, no information is written into
the hlog. If there is no information of deletion in the log, how can
we guarantee the data recovered by log are correct?

Thanks!

Yong


Re: is there anyway to turn off compaction in hbase

2012-08-13 Thread yonghu
Harsh is right. You find the wrong place.

regards!

Yong

On Sun, Aug 12, 2012 at 1:40 PM, Harsh J  wrote:
> Richard,
>
> The property disables major compactions from happening automatically.
> However, if you choose to do this, you should ensure you have a cron
> job that does trigger major_compact on all tables - for compaction is
> a necessary thing, but you just do not want it to happen at any time
> it likes to.
>
> Also, all properties to be overriden must go to
> $HBASE_HOME/conf/hbase-site.xml, and you do not need to edit source
> files and recompile for every config change you want to do.
>
> On Thu, Aug 9, 2012 at 5:37 PM, Richard Tang  
> wrote:
>> That is right. I have tried following properties, in conf/hbase-site.xml.
>> It looks it works to disable both major and minor (in some sense)
>> compactions.
>> Note these properties are defined in src/main/resources/hbase-default.xml,
>> but putting the property there without recompiling the src, it will not
>> work.
>>
>> 
>> hbase.hstore.compactionThreshold
>> 100
>> 
>> If more than this number of HStoreFiles in any one HStore
>> (one HStoreFile is written per flush of memstore) then a compaction
>> is run to rewrite all HStoreFiles files as one.  Larger numbers
>> put off compaction but when it runs, it takes longer to complete.
>> 
>> 
>>
>> 
>> hbase.hstore.blockingStoreFiles
>> 7
>> 
>> If more than this number of StoreFiles in any one Store
>> (one StoreFile is written per flush of MemStore) then updates are
>> blocked for this HRegion until a compaction is completed, or
>> until hbase.hstore.blockingWaitTime has been exceeded.
>> 
>> 
>>
>> 
>> hbase.hstore.blockingWaitTime
>> 9
>> 
>> The time an HRegion will block updates for after hitting the StoreFile
>> limit defined by hbase.hstore.blockingStoreFiles.
>> After this time has elapsed, the HRegion will stop blocking updates even
>> if a compaction has not been completed.  Default: 90 seconds.
>> 
>> 
>>
>> 
>> hbase.hstore.compaction.max
>> 100
>> Max number of HStoreFiles to compact per 'minor' compaction.
>> 
>> 
>>
>> 
>> hbase.hregion.majorcompaction
>> 0
>> The time (in miliseconds) between 'major' compactions of all
>> HStoreFiles in a region.  Default: 1 day.
>> Set to 0 to disable automated major compactions.
>> 
>> 
>>
>> On Thu, Aug 9, 2012 at 7:03 AM, henry.kim  wrote:
>>
>>> yes, there is a property in hbase-site.xml
>>>
>>> check the property which is named 'hbase.hregion.majorcompaction'
>>>
>>> if you set it to value 0, there will be no automatic compactions.
>>>
>>> 2012. 8. 9., 오후 7:43, Richard Tang  작성:
>>>
>>> > I am using hbase 0.92.1 and want to disable compaction (both major and
>>> > minor compactions) for a period of time. is that configurable in hbase?
>>>
>>>
>
>
>
> --
> Harsh J


Re: column based or row based storage for HBase?

2012-08-05 Thread yonghu
In my understanding of column-oriented structure of hbase, the first
thing is the term column-oriented. The meaning is that the data which
belongs to the same column family stores continuously in the disk. For
each column-family, the data is stored as row store. If you want to
understand the internal mechnisam of HBase, you'd better take a look
at the content of HFile.

regards!

Yong

On Mon, Aug 6, 2012 at 5:03 AM, Lin Ma  wrote:
> Thank you for the informative reply, Mohit!
>
> Some more comments,
>
> 1. actually my confusion about column based storage is from the book "HBase
> The Definitive Guide", chapter 1, section "the Dawn of Big Data", which
> draw a picture showing HBase store the same column of all different rows
> continuously physically in storage. Any comments?
>
> 2. I want to confirm my understanding is correct -- supposing I have only
> one column family with 10 columns, the physical storage is row (with all
> related columns) after row, other than store 1st column of all rows, then
> store 2nd columns of all rows, etc?
>
> 3. It seems when we say column based storage, there are two meanings, (1)
> column oriented database => en.wikipedia.org/wiki/Column-oriented_DBMS,
> where the same column of different rows stored together, (2) and column
> oriented architecture, e.g. how Hbase is designed, which is used to
> describe the pattern to store sparse, large number of columns (with NULL
> for free). Any comments?
>
> regards,
> Lin
>
> On Mon, Aug 6, 2012 at 12:08 AM, Mohit Anchlia wrote:
>
>> On Sun, Aug 5, 2012 at 6:04 AM, Lin Ma  wrote:
>>
>> > Hi guys,
>> >
>> > I am wondering whether HBase is using column based storage or row based
>> > storage?
>> >
>> >- I read some technical documents and mentioned advantages of HBase is
>> >using column based storage to store similar data together to foster
>> >compression. So it means same columns of different rows are stored
>> > together;
>>
>>
>> Probably what you read was in context of Column Families. HBase has concept
>> of column family similar to Google's bigtable. And the store files on disk
>> is per column family. All columns of a given column family are in one store
>> file and columns of different column family is a different file.
>>
>>
>> >- But I also learned HBase is a sorted key-value map in underlying
>> >HFile. It uses key to address all related columns for that key (row),
>> > so it
>> >seems to be a row based storage?
>> >
>> HBase stores entire row together along with columns represented by
>> KeyValue. This is also called cell in HBase.
>>
>>
>> > It is appreciated if anyone could clarify my confusions. Any related
>> > documents or code for more details are welcome.
>> >
>> > thanks in advance,
>> >
>> > Lin
>> >
>>


Re: Why Hadoop can't find Reducer when Mapper reads data from HBase?

2012-07-12 Thread yonghu
Strage thing is the same program works fine in the cluster.  By the
way, also in pseudo mode when MapReduce read data from Cassandra in
Map phase and transferred to Reduce phase, the same error happened.

regards!

Yong

On Thu, Jul 12, 2012 at 2:01 PM, Stack  wrote:
> On Thu, Jul 12, 2012 at 1:15 PM, yonghu  wrote:
>> java.lang.RuntimeException: java.lang.ClassNotFoundException:
>> com.mapreducetablescan.MRTableAccess$MTableReducer;
>>
>> Does anybody know why?
>>
>
> Its not in your job jar?  Check the job jar (jar -tf JAR_FILE).
>
> St.Ack


Re: HBase RegionServer can't connect to Master

2012-05-04 Thread yonghu
I think you can also use ifconfig command in the VM to see the ip
address. And then you can change your ip address in /etc/hosts.

Regards!

Yong

On Wed, May 2, 2012 at 7:21 PM, Ben Lisbakken  wrote:
> Hello --
>
> I've got a problem where the RegionServers try to connect to localhost for
> the Master, because that's what's being reported to them by the ZooKeeper.
>  Since they are not on the same machine, the requests fail:
> 2012-05-01 18:01:27,111 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to
> Master server at localhost,6
> 2012-05-01 18:02:27,276 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
> master. Retrying. Error was:
> java.net.ConnectException: Connection refused
>
> I've looked through the documentation, it seems there is no way to
> statically set the master's address.  I've read through this archive a bit
> and found this thread:
> http://mail-archives.apache.org/mod_mbox/hbase-user/201112.mbox/%3c28a2f04e.aa30.13440f376a4.coremail.exception...@163.com%3E
>
> The suggestions are basically to modify /etc/hosts.
>
> In my situation, I don't think I can use this approach.  I am doing an
> automated deployment on freshly deployed VMs and having a script modify
> /etc/hosts doesn't fit in well with this.
>
> There has to be a better way to force the master to report the correct
> address.
>
> Can anyone help?
>
> Thanks,
> Ben


Re: Hbase custom filter

2012-05-02 Thread yonghu
It means that java run time can't find
org/apache/hadoop/hbase/filter/FilterBase class. You have to add the
hbase.jar in your classpath.

regards!

Yong

On Wed, May 2, 2012 at 12:12 PM, cldo  wrote:
>
> i want to custom filter hbase.
> i created jar file by eclipse, copy to sever and in file hbase-env.xml i set
> "export HBASE_CLASSPATH=/cldo/hadoop/conf;/cldo/customfilter.jar
>
> but when start have error
>
> /cldo/hbase/bin/../conf/hbase-env.sh: line 29: /cldo/customfilter.jar:
> cannot execute binary file
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/hbase/filter/FilterBase
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
> at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>
> thank
> --
> View this message in context: 
> http://old.nabble.com/Hbase-custom-filter-tp33763367p33763367.html
> Sent from the HBase User mailing list archive at Nabble.com.
>


Re: Are minor compaction and major compaction different in HBase 0.92?

2012-04-27 Thread yonghu
Thanks for your reply. I install  HBase in pseudo-mode. I think my
situation is that the minor compaction promotes to major compaction. I
have 3 HFiles for table 'test' in one RegionServer.  After I issued
compact (org.apache.hadoop.hbase.client.HBaseAdmin.compact()) method,
those 3  HFiles are compacted into 1 HFile and deleted markers are
discarded. That's why I feel there is no difference from minor
compaction and major compaction.

regards!

Yong

On Fri, Apr 27, 2012 at 12:41 AM, lars hofhansl  wrote:
> The main practical difference is that only a major compaction cleans out 
> delete markers.
> Delete markers cannot be removed during a minor compaction since an affected 
> KeyValue could exist in an HFile that is not part of this compaction.
>
> -- Lars
> ________
>
> From: yonghu 
> To: user@hbase.apache.org
> Sent: Wednesday, April 25, 2012 11:48 PM
> Subject: Are minor compaction and major compaction different in HBase 0.92?
>
> Hello,
>
> My HBase version is 0.92.0. And I find that when I use minor
> compaction and major compaction to compact a table, there are no
> differences. In the minor compaction, it will remove the deleted cells
> and discard the exceeding data versions which should be the task of
> major compaction. I wonder if these two compaction can do the same
> work, why do we need both?  I mean the minor compaction has already
> done the job for major compaction, why we still need major compaction?
>
> Regards!
>
> Yong


Are minor compaction and major compaction different in HBase 0.92?

2012-04-25 Thread yonghu
Hello,

My HBase version is 0.92.0. And I find that when I use minor
compaction and major compaction to compact a table, there are no
differences. In the minor compaction, it will remove the deleted cells
and discard the exceeding data versions which should be the task of
major compaction. I wonder if these two compaction can do the same
work, why do we need both?  I mean the minor compaction has already
done the job for major compaction, why we still need major compaction?

Regards!

Yong


Re: Problem to Insert the row that i was deleted

2012-04-25 Thread yonghu
As Lars mentioned, the row is not physically deleted. The way which
Hbase uses is to insert a cell called "tombstone" which is used to
mask the deleted value, but value is still there (if the deleted value
is in the same memstore with tombstone, it will be deleted in the
memstore, so you will not find tombstone and deleted value in the same
HFile.) This is new in hbase 0.92.0. In the previous 0.90.*, both
tombstone and deleted value are in HFile.  If you want to read your
deleted data, you can read the HFile which exists in server side which
is supported by 0.90.* version. If you just read the table content at
client side, I am afraid you have to first run the major compaction,
and then reinsert your deleted data.

Reagards!

Yong

On Wed, Apr 25, 2012 at 8:14 AM, lars hofhansl  wrote:
> Your only chance is to run a major compaction on your table - that will get 
> rid of the delete marker. Then you can re-add the Put with the same TS.
>
> -- Lars
>
> ps. Rereading my email below... At some point I will learn to proof-read my 
> emails before I send them full of grammatical errors.
>
>
> - Original Message -
> From: Mahdi Negahi 
> To: Hbase 
> Cc:
> Sent: Tuesday, April 24, 2012 10:46 PM
> Subject: RE: Problem to Insert the row that i was deleted
>
>
>
> thanks for ur sharing
>
> so there is no solution for return back the row ( or cells/columns) ?
>
>
>> Date: Tue, 24 Apr 2012 22:39:49 -0700
>> From: lhofha...@yahoo.com
>> Subject: Re: Problem to Insert the row that i was deleted
>> To: user@hbase.apache.org
>>
>> Rows (or rather cells/columns) are not actually deleted. Instead they are 
>> marked for deletion by a delete marker. The deleted cells are collected 
>> during the next major or minor comaction.
>>
>> As long as the marker exist new Put (with thje same timestamp as the 
>> existing Put will affected by the delete marker.
>> The delete marker itself will exist until the next major compaction.
>>
>> This might seems strange, but is actually an important feature of HBase as 
>> it allows operations to be executed in any order with the same end result.
>>
>> -- Lars
>>
>> 
>> From: Mahdi Negahi 
>> To: Hbase 
>> Sent: Tuesday, April 24, 2012 9:05 PM
>> Subject: Problem to Insert the row that i was deleted
>>
>>
>>
>>
>>
>> I delete a row and I want to add the same row ( with same Timestamp ) to 
>> HBase but it is not added to the table. I know if I changed the timestamp it 
>> will added but it is necessary to add it with the same timestamp.
>>
>> please advice me where is my problem ?
>>
>> regard
>> mahdi


Re: HBase 0.92 with Hadoop 0.22

2012-04-16 Thread yonghu
yes. You can compile the hadoop jar file  by yourself and put into the
Hbase lib folder.

Regards!

Yong

On Mon, Apr 16, 2012 at 2:09 PM, Harsh J  wrote:
> While I haven't tried this personally, it should be alright to do. You
> need to replace HBase's default hadoop jars (which are 1.0.x/0.20
> versioned) with those (of common and hdfs) from your 0.22
> installation.
>
> Apache Bigtop too has a branch for hadoop-0.22 that helps you build a
> whole 0.22-based, tested and packaged stack for yourself:
> https://svn.apache.org/repos/asf/incubator/bigtop/branches/hadoop-0.22/
>
> On Mon, Apr 16, 2012 at 5:30 PM, Konrad Tendera  wrote:
>> I'm wondering if there is any possibility to run HBase 0.92 on top of Hadoop 
>> 0.22? I can't find necessary jars such as hadoop-core...
>>
>> --
>> Konrad Tendera
>
>
>
> --
> Harsh J


Re: Is it possible to install two different Hbase versions in the same Cluster?

2012-04-16 Thread yonghu
Mike,

Can you explain why I can't put the RS on the same node?

Thanks!

Yong

On Mon, Apr 16, 2012 at 1:33 PM, Michel Segel  wrote:
> Sure, just make sure you don't cross the configurations and don't put the RS 
> on the same nodes.
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 16, 2012, at 6:31 AM, yonghu  wrote:
>
>> Hello,
>>
>> I wonder if it's possible to install two different Hbase versions in
>> the same cluster?
>>
>> Thanks
>>
>> Yong
>>


Re: dump HLog content!

2012-04-15 Thread yonghu
Thanks for your important information. I have found this information
in the hbase-default.xml file.

Regards!

Yong

On Sun, Apr 15, 2012 at 8:36 PM, Manish Bhoge
 wrote:
> Yong,
>
> It is a Hlog log roll property that keep the log size 0 until the complete 
> block is written OR until it completes the log roll duration mentioned in 
> configuration (default 60 min). However it still persists the edits in .edit 
> files and once it reaches to the interval defined for log roll it writes back 
> to log. That is the reason you can see the logs(size) more than zero 
> byte.eventually it moves the log into .oldlogs also.
>
> Thanks
> Manish
> Sent from my BlackBerry, pls excuse typo
>
> -Original Message-
> From: yonghu 
> Date: Sun, 15 Apr 2012 18:58:45
> To: 
> Reply-To: user@hbase.apache.org
> Subject: Re: dump HLog content!
>
> Thanks for your reply. After nearly 60minutes, I can see the Hlog volume.
>
> -rw-r--r--   3 yonghu supergroup       2125 2012-04-15 17:34
> /hbase/.logs/yonghu-laptop,60020,1334504008467/yonghu-laptop%2C60020%2C1334504008467.1334504048854
>
> I have no idea why it takes so long time.
>
> Yong
>
> On Sun, Apr 15, 2012 at 6:34 PM, yonghu  wrote:
>> yes
>>
>> On Sun, Apr 15, 2012 at 6:30 PM, Ted Yu  wrote:
>>> Did 'HLog --dump' show real contents for a 0-sized file ?
>>>
>>> Cheers
>>>
>>> On Sun, Apr 15, 2012 at 8:58 AM, yonghu  wrote:
>>>
>>>> Hello,
>>>>
>>>> My hbase version is 0.92.0 and is installed in pseudo-mode. I found a
>>>> strange situation of HLog. After I inserted new data value into table,
>>>> the volume of HLog is 0. I checked in HDFS.
>>>>
>>>> drwxr-xr-x   - yonghu supergroup          0 2012-04-15 17:34 /hbase/.logs
>>>> drwxr-xr-x   - yonghu supergroup          0 2012-04-15 17:34
>>>> /hbase/.logs/yonghu-laptop,60020,1334504008467
>>>> -rw-r--r--   3 yonghu supergroup          0 2012-04-15 17:34
>>>>
>>>> /hbase/.logs/yonghu-laptop,60020,1334504008467/yonghu-laptop%2C60020%2C1334504008467.1334504048854
>>>>
>>>> But I can use hbase org.apache.hadoop.hbase.regionserver.wal.HLog
>>>> --dump to see the content of log information. However, if I write java
>>>> program to extract the log information. The output is null! Somebody
>>>> knows why?
>>>>
>>>> Thanks!
>>>>
>>>> Yong
>>>>


Re: dump HLog content!

2012-04-15 Thread yonghu
Thanks for your reply. After nearly 60minutes, I can see the Hlog volume.

-rw-r--r--   3 yonghu supergroup   2125 2012-04-15 17:34
/hbase/.logs/yonghu-laptop,60020,1334504008467/yonghu-laptop%2C60020%2C1334504008467.1334504048854

I have no idea why it takes so long time.

Yong

On Sun, Apr 15, 2012 at 6:34 PM, yonghu  wrote:
> yes
>
> On Sun, Apr 15, 2012 at 6:30 PM, Ted Yu  wrote:
>> Did 'HLog --dump' show real contents for a 0-sized file ?
>>
>> Cheers
>>
>> On Sun, Apr 15, 2012 at 8:58 AM, yonghu  wrote:
>>
>>> Hello,
>>>
>>> My hbase version is 0.92.0 and is installed in pseudo-mode. I found a
>>> strange situation of HLog. After I inserted new data value into table,
>>> the volume of HLog is 0. I checked in HDFS.
>>>
>>> drwxr-xr-x   - yonghu supergroup          0 2012-04-15 17:34 /hbase/.logs
>>> drwxr-xr-x   - yonghu supergroup          0 2012-04-15 17:34
>>> /hbase/.logs/yonghu-laptop,60020,1334504008467
>>> -rw-r--r--   3 yonghu supergroup          0 2012-04-15 17:34
>>>
>>> /hbase/.logs/yonghu-laptop,60020,1334504008467/yonghu-laptop%2C60020%2C1334504008467.1334504048854
>>>
>>> But I can use hbase org.apache.hadoop.hbase.regionserver.wal.HLog
>>> --dump to see the content of log information. However, if I write java
>>> program to extract the log information. The output is null! Somebody
>>> knows why?
>>>
>>> Thanks!
>>>
>>> Yong
>>>


Re: dump HLog content!

2012-04-15 Thread yonghu
yes

On Sun, Apr 15, 2012 at 6:30 PM, Ted Yu  wrote:
> Did 'HLog --dump' show real contents for a 0-sized file ?
>
> Cheers
>
> On Sun, Apr 15, 2012 at 8:58 AM, yonghu  wrote:
>
>> Hello,
>>
>> My hbase version is 0.92.0 and is installed in pseudo-mode. I found a
>> strange situation of HLog. After I inserted new data value into table,
>> the volume of HLog is 0. I checked in HDFS.
>>
>> drwxr-xr-x   - yonghu supergroup          0 2012-04-15 17:34 /hbase/.logs
>> drwxr-xr-x   - yonghu supergroup          0 2012-04-15 17:34
>> /hbase/.logs/yonghu-laptop,60020,1334504008467
>> -rw-r--r--   3 yonghu supergroup          0 2012-04-15 17:34
>>
>> /hbase/.logs/yonghu-laptop,60020,1334504008467/yonghu-laptop%2C60020%2C1334504008467.1334504048854
>>
>> But I can use hbase org.apache.hadoop.hbase.regionserver.wal.HLog
>> --dump to see the content of log information. However, if I write java
>> program to extract the log information. The output is null! Somebody
>> knows why?
>>
>> Thanks!
>>
>> Yong
>>


dump HLog content!

2012-04-15 Thread yonghu
Hello,

My hbase version is 0.92.0 and is installed in pseudo-mode. I found a
strange situation of HLog. After I inserted new data value into table,
the volume of HLog is 0. I checked in HDFS.

drwxr-xr-x   - yonghu supergroup  0 2012-04-15 17:34 /hbase/.logs
drwxr-xr-x   - yonghu supergroup  0 2012-04-15 17:34
/hbase/.logs/yonghu-laptop,60020,1334504008467
-rw-r--r--   3 yonghu supergroup  0 2012-04-15 17:34
/hbase/.logs/yonghu-laptop,60020,1334504008467/yonghu-laptop%2C60020%2C1334504008467.1334504048854

But I can use hbase org.apache.hadoop.hbase.regionserver.wal.HLog
--dump to see the content of log information. However, if I write java
program to extract the log information. The output is null! Somebody
knows why?

Thanks!

Yong


Re: A confusion of RegionCoprocessorEnvironment.getReion() method

2012-04-10 Thread yonghu
Thanks for your explanation. Now it's clear for me.

Regards!

Yong

On Tue, Apr 10, 2012 at 6:13 PM, Gary Helmling  wrote:
> Each and every HRegion on a given region server will have it's own
> distinct instance of your configured RegionObserver class.
> RegionCoprocessorEnvironment.getRegion() returns a reference to the
> HRegion containing the current coprocessor instance.
>
> The hierarchy is essentially:
>
> HRegionServer
> \_  HRegion
>        \_ RegionCoprocessorHost
>             \_  
>
> (repeated for each HRegion).
>
> This blog post by Mingjie may help explain things a bit more:
> https://blogs.apache.org/hbase/entry/coprocessor_introduction
>
>
> --gh
>
>
>
> On Tue, Apr 10, 2012 at 2:30 AM, yonghu  wrote:
>> Hello,
>>
>> The description of this method is " /** @return the region associated
>> with this coprocessor */" and the return value is an HRegion instance.
>> If I configure the region-coprocessor class in hbase-site.xml.  It
>> means that this coprocessor will be applied to every HRegion which
>> resides on this Region Server (if I understand right).  Why this
>> method only return one HRgion instance not a list of HRgion
>> instances?Suppose that a region server has two HRegions, one is for
>> table 'test1', the other is for table 'test2'.  Which HRgion instance
>> will be returned if I call RegionCoprocessorEnvironment.getReion()?
>>
>> Thanks!
>>
>> Yong


A confusion of RegionCoprocessorEnvironment.getReion() method

2012-04-10 Thread yonghu
Hello,

The description of this method is " /** @return the region associated
with this coprocessor */" and the return value is an HRegion instance.
If I configure the region-coprocessor class in hbase-site.xml.  It
means that this coprocessor will be applied to every HRegion which
resides on this Region Server (if I understand right).  Why this
method only return one HRgion instance not a list of HRgion
instances?Suppose that a region server has two HRegions, one is for
table 'test1', the other is for table 'test2'.  Which HRgion instance
will be returned if I call RegionCoprocessorEnvironment.getReion()?

Thanks!

Yong


Re: Still Seeing Old Data After a Delete

2012-03-27 Thread yonghu
Hi Shwan,

My hbase-version is 0.92.0. I have to mention that in recently I
noticed that the delete semantics between shell and Java api are
different. In shell, if you delete one version, it will mask the
versions whose timestamps are older than that version, it means that
scan will not return the values whose values are older than that one.
But, if you use Java api, e.g.  delete.deleteColumn() method, it will
only delete that specific version. It will not affect the versions
whose timestamps are older than that one. I hope it's useful for you!

Regards!

Yong

On Tue, Mar 27, 2012 at 7:33 PM, Shawn Quinn  wrote:
> Hi Lars,
>
> Thanks for the quick reply!  In this case we we're doing a column delete
> like so:
>
>            Delete delete = new Delete(rowKey);
>            delete.deleteColumn(Bytes.toBytes("thing"),
> Bytes.toBytes(value));
>            table.delete(delete);
>
> However, your response caused me to notice the "Delete.deleteColumns()"
> method in the JavaDoc instead of simply "Delete.deleteColumn()".  Calling
> the "deleteColumns" instead of "deleteColumn" fixes the problem we were
> seeing.  That wasn't immediately obvious to me after reading the book, but
> after reading the JavaDoc I now understand the distinction between the two
> methods.
>
> I may be the only one who missed that at first, but in case others have a
> similar confusion it might be worth a comment in the book that
> "deleteColumn()" is really only for deleting a single version and
> "deleteColumns()" is for deleting all versions.  E.g. the second type noted
> in the book currently is listed as "Delete column: for all versions of a
> column".  But, from the API perspective that's really the "deleteColumns()"
> method.  (Whereas, my incorrect intuition when just looking at the API was
> that the "deleteColumns()" method would likely be for deleting multiple
> different columns.)
>
> Thanks again for the quick follow up,
>
>     -Shawn
>
> On Tue, Mar 27, 2012 at 1:19 PM, lars hofhansl  wrote:
>
>> Hey Shawn,
>>
>> how exactly did you delete the column?
>> There are three types of delete markers: family, column, version.
>> Your observation would be consistent with having used a version delete
>> marker, which just marks are a specific version (the latest by default) for
>> delete.
>>
>> Check out the HBase Reference Guide:
>> http://hbase.apache.org/book.html#version.delete
>>
>> Also, if you don't mind the plug see a more detailed discussion here:
>> http://hadoop-hbase.blogspot.com/2011/12/deletion-in-hbase.html
>>
>> -- Lars
>>
>>
>> - Original Message -
>> From: Shawn Quinn 
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Tuesday, March 27, 2012 10:01 AM
>> Subject: Still Seeing Old Data After a Delete
>>
>> Hello,
>>
>> In a couple of situations we were noticing some odd problems with old data
>> appearing in the application, and I finally found a reproducible scenario.
>> Here's what we're seeing in one basic case:
>>
>> 1. Using a scan in hbase shell one of our column cells (both the column
>> name and value are simple long's) looks like so:
>>
>> column=thing:\x00\x00\x00\x00\x00\x00\x00\x02, timestamp=1332795701976,
>> value=\x00\x00\x00\x00\x00\x00\x00s
>>
>> 2. If we then use a "Put" to update that cell to a new value it looks as
>> we'd expect like so:
>>
>> column=thing:\x00\x00\x00\x00\x00\x00\x00\x02, timestamp=1332866682295,
>> value=\x00\x00\x00\x00\x00\x00\x00u
>>
>> 3. If we then use a "Delete" to remove that column, instead of the column
>> no longer being included in the scan we instead see the following again:
>>
>> column=thing:\x00\x00\x00\x00\x00\x00\x00\x02, timestamp=1332795701976,
>> value=\x00\x00\x00\x00\x00\x00\x00s
>>
>> So, for some reason, at least in this case, the tombstone/delete marker
>> doesn't appear to be preventing new scans from no longer seeing the old
>> data.
>>
>> Note that this is a small development cluster of HBase (version:
>> hbase-0.90.4-cdh3u2) which contains one master and three region servers,
>> and I have confirmed that the clocks are synchronized properly between the
>> four machines.  Also note that we're using the Java client API to run the
>> Put/Delete commands noted above.
>>
>> Any ideas on how old data could still appear in a Get/Scan like this, and
>> if there are any workarounds we could try?  I saw HBASE-4536, but after
>> reading that thread it didn't seem pertinent to this more basic scenario.
>>
>> Thanks in advance for any pointers!
>>
>>       -Shawn
>>
>>


Re: There is no data value information in HLog?

2012-03-20 Thread yonghu
Thanks for your response. That's really helpful.

regards!

Yong

On Mon, Mar 19, 2012 at 5:32 PM, Ted Yu  wrote:
> Hi,
> Have you noticed this in HLogPrettyPrinter ?
>    options.addOption("p", "printvals", false, "Print values");
>
> Looks like you should have specified the above option.
>
> On Mon, Mar 19, 2012 at 7:31 AM, yonghu  wrote:
>
>> Hello,
>>
>> I used the $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog
>> --dump command to check the HLog information. But I can not find any
>> data information. The output of my HLog file is looks like follows:
>>
>> Sequence 933 from region 85986149309dff24ecf7be4873136f15 in table test
>>  Action:
>>    row: Udo
>>    column: Course:Computer
>>    at time: Mon Mar 19 14:09:29 CET 2012
>>
>> Sequence 935 from region 85986149309dff24ecf7be4873136f15 in table test
>>  Action:
>>    row: Udo
>>    column: Course:Math
>>    at time: Mon Mar 19 14:09:29 CET 2012
>>
>> The functionality of HLog is for recovery. But without data value
>> information, how can hbase use the information in HLog to do recovery.
>> My hbase version is 0.92.0.
>>
>> Regards!
>>
>> Yong
>>


There is no data value information in HLog?

2012-03-19 Thread yonghu
Hello,

I used the $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog
--dump command to check the HLog information. But I can not find any
data information. The output of my HLog file is looks like follows:

Sequence 933 from region 85986149309dff24ecf7be4873136f15 in table test
  Action:
row: Udo
column: Course:Computer
at time: Mon Mar 19 14:09:29 CET 2012

Sequence 935 from region 85986149309dff24ecf7be4873136f15 in table test
  Action:
row: Udo
column: Course:Math
at time: Mon Mar 19 14:09:29 CET 2012

The functionality of HLog is for recovery. But without data value
information, how can hbase use the information in HLog to do recovery.
My hbase version is 0.92.0.

Regards!

Yong


  1   2   >