Re: multiple data versions vs. multiple rows?

2015-01-20 Thread yonghu
I think we need to take a look different situations.

1. One column gets frequently updated and the others not. If we use row
representation, we will include the unchanged data value for each tuple.
This may cause a large data redundancy. So, I think it can explain why in
my test the multiple data version approach is better than multiple row
approach.

2. All columns get even updates. Hence, there will be not much data volume
difference between these two, as each data version is actually stored as a
key-value pair. In this situation, the performance between these two
approaches will not be significant.

Yong

On Tue, Jan 20, 2015 at 8:16 AM, Serega Sheypak serega.shey...@gmail.com
wrote:

 does performance should differ significantly if row value size is small and
 we don't have too much versions.
 Assume, that a pack of versions for key is less than recommended HFile
 block (8KB to 1MB

 https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html
 ),
 which is minimal read unit, should we see any difference at all?
 Am I right?


 2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari jean-m...@spaggiari.org:

  Hi Yong,
 
  If you want to compare the performances, you need to run way bigger and
  longer tests. Dont run them in parallete. Run them at least 10 time each
 to
  make sure you have a good trend. Is the difference between the 2
  significant? It should not.
 
  JM
 
  2015-01-19 15:17 GMT-05:00 yonghu yongyong...@gmail.com:
 
   Hi,
  
   Thanks for your suggestion. I have already considered the first issue
  that
   one row  is not allowed to be split between 2 regions.
  
   However, I have made a small scan-test with MapReduce. I first created
 a
   table t1 with 1 million rows and allowed each column to store 10 data
   versions. Then, I translated t1 into t2 in which multiple data versions
  in
   t1 were transformed into multiple rows in t2. I wrote two MapReduce
   programs to scan t1 and t2 individually. What I got is the table
 scanning
   time of t1 is shorter than t2. So, I think for performance reason,
  multiple
   data versions may be a better option than multiple rows.
  
   But just as you said, which approach to use depends on how many
  historical
   events you want to keep.
  
   regards!
  
   Yong
  
  
   On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari 
   jean-m...@spaggiari.org wrote:
  
Hi Yong,
   
A row will not split between 2 regions. If you plan having thousands
 of
versions, based on the size of your data, you might end up having a
 row
bigger than your preferred region size.
   
If you plan just keep few versions of the history to have a look at
  it, I
will say go with it. If you plan to have one million version because
  you
want to keep all the events history, go with the row approach.
   
You can also consider going with the Column Qualifier approach. This
  has
the same constraint as the versions regarding the split in 2 regions,
  but
it might me easier to manage and still give you the consistency of
  being
within a row.
   
JM
   
2015-01-19 14:28 GMT-05:00 yonghu yongyong...@gmail.com:
   
 Dear all,

 I want to record the user history data. I know there exists two
   options,
 one is to store user events in a single row with multiple data
  versions
and
 the other one is to use multiple rows. I wonder which one is better
  for
 performance?

 Thanks!

 Yong

   
  
 



Low-latency queries, HBase exclusively or should I go, e.g.: MongoDB?

2015-01-20 Thread Alec Taylor
I am architecting a platform incorporating: recommender systems,
information retrieval (ML), sequence mining, and Natural Language
Processing.

Additionally I have the generic CRUD and authentication components,
with everything exposed RESTfully.

For the storage layer(s), there are a few options which immediately
present themselves:

Generic CRUD layer (high speed needed here, though I suppose I could use Redis…)

- Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
SQL layer atop
- Apache Spark (perhaps piping to HDFS)… ¿maybe?
- MongoDB (or a similar document-store), a graph-database, or even
something like Postgres

Analytics layer (to enable Big Data / Data-intensive computing features)

- Apache Spark
- Hadoop with MapReduce and/or utilising some other Apache /
non-Apache project with integration
- Disco (from Nokia)



Should I prefer one layer—e.g.: on HDFS—over multiple disparite
layers? - The advantage here is obvious, but I am certain there are
disadvantages. (and yes, I know there are various ways; automated and
manual; to push data from non HDFS-backed stores to HDFS)

Also, as a bonus answer, which stack would you recommend for this
user-network I'm building?


Re: IllegalArgumentException: Connection is null or closed when calling HConnection.getTable()

2015-01-20 Thread Calvin Lei
Hi Nick,
 the HConnection is unmanaged and we cache it for the lifetime of the
application until it shuts down. I am not calling HConnection.close
anywhere in my code except for the shutdown hook.

On Mon, Jan 19, 2015 at 7:39 PM, Nick Dimiduk ndimi...@gmail.com wrote:

 Hi Calvin,

 An HConnection created via
 HConnectionManager#createConnection(Configuration) is an unmanaged
 connection, meaning it's lifecycle is managed by your code. Are you calling
 HConnection#close() on that instance someplace?

 Please notice that these are different semantics from the previous
 HConnection#getConnection(Configuration), which returns a managed
 connection, one who's lifecycle is managed by the HBase client.

 -n

 On Mon, Jan 19, 2015 at 4:29 PM, Calvin Lei ckp...@gmail.com wrote:

  Thanks. I was more curious why the connection would be closed.
 
  On Mon, Jan 19, 2015 at 5:22 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   Here is related code from HTable ctor:
  
   if (connection == null || connection.isClosed()) {
 throw new IllegalArgumentException(Connection is null or
  closed.);
   }
   It was likely that connection was closed (from your description of your
   code).
  
   If HConnectionImplementation were to check the status of connection
  before
   calling HTable ctor, that would be helpful.
   For the moment, your application should check the status of connection.
  
   Cheers
  
   On Mon, Jan 19, 2015 at 1:48 PM, Calvin Lei ckp...@gmail.com wrote:
  
I upgraded to 0.98.0.2.1.1.0-385-hadoop2. The exception from hbase
 is:
   
java.lang.IllegalArgumentException: Connection is null or closed.
  at org.apache.hadoop.hbase.client.HTable.init(HTable.java:302)
  at
   
   
  
 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:763)
  at
   
   
  
 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:745)
  at
   
   
  
 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:740)
   
   
On Mon, Jan 19, 2015 at 4:37 PM, Ted Yu yuzhih...@gmail.com wrote:
   
 Which 0.98 release did you upgrade to ?

 Can you pastebin the whole stack trace ?

 Thanks



  On Jan 19, 2015, at 1:25 PM, Calvin Lei ckp...@gmail.com
 wrote:
 
  Dear all,
I recently upgrade to HBase 0.0.98 and I have started seeing
 the
error
  Connection is null or closed when calling
 HConnection.getTable().
As recommended by the documentation, I create a HConnection
 using
  HConnectionManager.createConnection(config) and at app start and
   close
 the
  connection at app shut down. Looks like the state of the cluster
  has
  changed during the lifetime of the app and the HConnection closes
  out
all
  connections.
 Do i have to check isClosed() before I call getTable()?

   
  
 



[HBase Restful] How to get a composite RowKey: StringInteger

2015-01-20 Thread anil gupta
Hi,

I want to fetch a row from hbase rest client. Composite Rowkey is:
StringInteger
This is how the rowkey looks on hbase shell: ABCDEF\x00\x00\x00\x00\x01
I tried to do this request: hostname:8080/TABLE/ABCDEF%2Fx00%2Fx00%2Fx00%2F
x00%2Fx01
Is it possible to get the this kind of row from Restful interface? If yes,
Can anyone tell me how to get this row via Restful?

Thanks  Regards,
Anil Gupta


Re: Given a Put object, is there any way to change the timestamp of it?

2015-01-20 Thread Ted Yu
bq. I have no way to get ALL its cells out
Mutation has the following method:

  public NavigableMapbyte [], ListCell getFamilyCellMap() {

FYI

On Tue, Jan 20, 2015 at 5:43 PM, Liu, Ming (HPIT-GADSC) ming.l...@hp.com
wrote:

 Hello, there,

 I am developing a coprocessor under HBase 0.98.6. The client send a Put
 object to the coprocessor in Protobuf, when the coprocessor receive the
 message , it invokes ProtobufUtil.toPut to convert it to a Put object. Do
 various checking  and then put it into HBase table.
 Now, I get a requirement to change the timestamp of that Put object, but I
 found no way to do this.

 I was first try to generate a new Put object with a new timestamp, and try
 to copy the old one into this new object. But I found given a Put object, I
 have no way to get ALL its cells out if I don't know the column family and
 column qualifier name in advance. In my case, those CF/Column names are
 random as user defined. So I stuck here. Could anyone have idea how to
 workaround this?

 The Mutation class has getTimestamp() method but no setTimestamp(). I wish
 there is a setTimestamp() for it. Is there any reason it is not provided? I
 hope in future release Mutation can expose a setTimestamp() method, is it
 possible? If so, my job will get much easier...

 Thanks,
 Ming



Given a Put object, is there any way to change the timestamp of it?

2015-01-20 Thread Liu, Ming (HPIT-GADSC)
Hello, there,

I am developing a coprocessor under HBase 0.98.6. The client send a Put object 
to the coprocessor in Protobuf, when the coprocessor receive the message , it 
invokes ProtobufUtil.toPut to convert it to a Put object. Do various checking  
and then put it into HBase table.
Now, I get a requirement to change the timestamp of that Put object, but I 
found no way to do this.

I was first try to generate a new Put object with a new timestamp, and try to 
copy the old one into this new object. But I found given a Put object, I have 
no way to get ALL its cells out if I don't know the column family and column 
qualifier name in advance. In my case, those CF/Column names are random as user 
defined. So I stuck here. Could anyone have idea how to workaround this?

The Mutation class has getTimestamp() method but no setTimestamp(). I wish 
there is a setTimestamp() for it. Is there any reason it is not provided? I 
hope in future release Mutation can expose a setTimestamp() method, is it 
possible? If so, my job will get much easier...

Thanks,
Ming


Re: Does 'online region merge' make regions unavailable for some time?

2015-01-20 Thread Otis Gospodnetic
Hi,

Considering this is called the *online* region merge, I would assume
regions being merged never go offline during the merge and both regions
being merged are available for reading and writing at all times, even
during the merge though I don't get how writes would work if one region
is being moved from one RS to another so maybe this is not truly online
and writes are either rejected or buffered/blocked until the region is
moved AND merged?  Anyone knows for sure?

I see this in one of the comments:
Q: If one (or both) of the regions were receiving non-trivial load prior to
this action, would client(s) be affected ?
A: Yes, region would be off services in a short time, it is equal with
moving region, e.g balance a region

Also took a look at the patch:
https://issues.apache.org/jira/secure/attachment/12574965/hbase-7403-trunkv33.patch

And see:

+/**
+ * The merging region A has been taken out of the server's online
regions list.
+ */
+OFFLINED_REGION_A,


... and if you look for the word offline in the patch I think it's
pretty clear that BOTH regions being merged do go offline at some
point.  I guess it could be after the merge, too, not before

... maybe others know?


Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Jan 19, 2015 at 4:17 AM, Vladimir Tretyakov 
vladimir.tretya...@sematext.com wrote:

 Hi, I have one question about 'online region merge' (
 https://issues.apache.org/jira/browse/HBASE-7403).
 How I've understood regions which will be passed to merge method will be
 unavailable for some time.

 That means:
 1. Some data will be unavailable some time.
 2. If client will try to write data to these regions it will get
 exceptions.

 Are above sentences correct?

 Somebody can estimate time which 1 and 2 will be true? Seconds, minutes or
 hours? Is there any way to avoid 1 and 2?

 I am asking because now we have problem during time with number of regions
 (our key contains timestamp), count of regions growing constantly
 (splitting) and it become a cause of performance problem with time.
 For avoiding this effect we use 2 tables:
 1. First table we use for writing and reading data.
 2. Second we use only for reading data.

 After some time we truncate second table and rotate these tables (first
 become second and second become first). That allow us control count of
 regions, but solution looks a bit ugly, I looked at 'online region merge',
 but we can't live with restrictions I've described in first part of
 question.

 Can somebody help with answers?

 Thx, Vladimir Tretyakov.