Re: multiple data versions vs. multiple rows?
I think we need to take a look different situations. 1. One column gets frequently updated and the others not. If we use row representation, we will include the unchanged data value for each tuple. This may cause a large data redundancy. So, I think it can explain why in my test the multiple data version approach is better than multiple row approach. 2. All columns get even updates. Hence, there will be not much data volume difference between these two, as each data version is actually stored as a key-value pair. In this situation, the performance between these two approaches will not be significant. Yong On Tue, Jan 20, 2015 at 8:16 AM, Serega Sheypak serega.shey...@gmail.com wrote: does performance should differ significantly if row value size is small and we don't have too much versions. Assume, that a pack of versions for key is less than recommended HFile block (8KB to 1MB https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html ), which is minimal read unit, should we see any difference at all? Am I right? 2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari jean-m...@spaggiari.org: Hi Yong, If you want to compare the performances, you need to run way bigger and longer tests. Dont run them in parallete. Run them at least 10 time each to make sure you have a good trend. Is the difference between the 2 significant? It should not. JM 2015-01-19 15:17 GMT-05:00 yonghu yongyong...@gmail.com: Hi, Thanks for your suggestion. I have already considered the first issue that one row is not allowed to be split between 2 regions. However, I have made a small scan-test with MapReduce. I first created a table t1 with 1 million rows and allowed each column to store 10 data versions. Then, I translated t1 into t2 in which multiple data versions in t1 were transformed into multiple rows in t2. I wrote two MapReduce programs to scan t1 and t2 individually. What I got is the table scanning time of t1 is shorter than t2. So, I think for performance reason, multiple data versions may be a better option than multiple rows. But just as you said, which approach to use depends on how many historical events you want to keep. regards! Yong On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Yong, A row will not split between 2 regions. If you plan having thousands of versions, based on the size of your data, you might end up having a row bigger than your preferred region size. If you plan just keep few versions of the history to have a look at it, I will say go with it. If you plan to have one million version because you want to keep all the events history, go with the row approach. You can also consider going with the Column Qualifier approach. This has the same constraint as the versions regarding the split in 2 regions, but it might me easier to manage and still give you the consistency of being within a row. JM 2015-01-19 14:28 GMT-05:00 yonghu yongyong...@gmail.com: Dear all, I want to record the user history data. I know there exists two options, one is to store user events in a single row with multiple data versions and the other one is to use multiple rows. I wonder which one is better for performance? Thanks! Yong
Low-latency queries, HBase exclusively or should I go, e.g.: MongoDB?
I am architecting a platform incorporating: recommender systems, information retrieval (ML), sequence mining, and Natural Language Processing. Additionally I have the generic CRUD and authentication components, with everything exposed RESTfully. For the storage layer(s), there are a few options which immediately present themselves: Generic CRUD layer (high speed needed here, though I suppose I could use Redis…) - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema SQL layer atop - Apache Spark (perhaps piping to HDFS)… ¿maybe? - MongoDB (or a similar document-store), a graph-database, or even something like Postgres Analytics layer (to enable Big Data / Data-intensive computing features) - Apache Spark - Hadoop with MapReduce and/or utilising some other Apache / non-Apache project with integration - Disco (from Nokia) Should I prefer one layer—e.g.: on HDFS—over multiple disparite layers? - The advantage here is obvious, but I am certain there are disadvantages. (and yes, I know there are various ways; automated and manual; to push data from non HDFS-backed stores to HDFS) Also, as a bonus answer, which stack would you recommend for this user-network I'm building?
Re: IllegalArgumentException: Connection is null or closed when calling HConnection.getTable()
Hi Nick, the HConnection is unmanaged and we cache it for the lifetime of the application until it shuts down. I am not calling HConnection.close anywhere in my code except for the shutdown hook. On Mon, Jan 19, 2015 at 7:39 PM, Nick Dimiduk ndimi...@gmail.com wrote: Hi Calvin, An HConnection created via HConnectionManager#createConnection(Configuration) is an unmanaged connection, meaning it's lifecycle is managed by your code. Are you calling HConnection#close() on that instance someplace? Please notice that these are different semantics from the previous HConnection#getConnection(Configuration), which returns a managed connection, one who's lifecycle is managed by the HBase client. -n On Mon, Jan 19, 2015 at 4:29 PM, Calvin Lei ckp...@gmail.com wrote: Thanks. I was more curious why the connection would be closed. On Mon, Jan 19, 2015 at 5:22 PM, Ted Yu yuzhih...@gmail.com wrote: Here is related code from HTable ctor: if (connection == null || connection.isClosed()) { throw new IllegalArgumentException(Connection is null or closed.); } It was likely that connection was closed (from your description of your code). If HConnectionImplementation were to check the status of connection before calling HTable ctor, that would be helpful. For the moment, your application should check the status of connection. Cheers On Mon, Jan 19, 2015 at 1:48 PM, Calvin Lei ckp...@gmail.com wrote: I upgraded to 0.98.0.2.1.1.0-385-hadoop2. The exception from hbase is: java.lang.IllegalArgumentException: Connection is null or closed. at org.apache.hadoop.hbase.client.HTable.init(HTable.java:302) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:763) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:745) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getTable(HConnectionManager.java:740) On Mon, Jan 19, 2015 at 4:37 PM, Ted Yu yuzhih...@gmail.com wrote: Which 0.98 release did you upgrade to ? Can you pastebin the whole stack trace ? Thanks On Jan 19, 2015, at 1:25 PM, Calvin Lei ckp...@gmail.com wrote: Dear all, I recently upgrade to HBase 0.0.98 and I have started seeing the error Connection is null or closed when calling HConnection.getTable(). As recommended by the documentation, I create a HConnection using HConnectionManager.createConnection(config) and at app start and close the connection at app shut down. Looks like the state of the cluster has changed during the lifetime of the app and the HConnection closes out all connections. Do i have to check isClosed() before I call getTable()?
[HBase Restful] How to get a composite RowKey: StringInteger
Hi, I want to fetch a row from hbase rest client. Composite Rowkey is: StringInteger This is how the rowkey looks on hbase shell: ABCDEF\x00\x00\x00\x00\x01 I tried to do this request: hostname:8080/TABLE/ABCDEF%2Fx00%2Fx00%2Fx00%2F x00%2Fx01 Is it possible to get the this kind of row from Restful interface? If yes, Can anyone tell me how to get this row via Restful? Thanks Regards, Anil Gupta
Re: Given a Put object, is there any way to change the timestamp of it?
bq. I have no way to get ALL its cells out Mutation has the following method: public NavigableMapbyte [], ListCell getFamilyCellMap() { FYI On Tue, Jan 20, 2015 at 5:43 PM, Liu, Ming (HPIT-GADSC) ming.l...@hp.com wrote: Hello, there, I am developing a coprocessor under HBase 0.98.6. The client send a Put object to the coprocessor in Protobuf, when the coprocessor receive the message , it invokes ProtobufUtil.toPut to convert it to a Put object. Do various checking and then put it into HBase table. Now, I get a requirement to change the timestamp of that Put object, but I found no way to do this. I was first try to generate a new Put object with a new timestamp, and try to copy the old one into this new object. But I found given a Put object, I have no way to get ALL its cells out if I don't know the column family and column qualifier name in advance. In my case, those CF/Column names are random as user defined. So I stuck here. Could anyone have idea how to workaround this? The Mutation class has getTimestamp() method but no setTimestamp(). I wish there is a setTimestamp() for it. Is there any reason it is not provided? I hope in future release Mutation can expose a setTimestamp() method, is it possible? If so, my job will get much easier... Thanks, Ming
Given a Put object, is there any way to change the timestamp of it?
Hello, there, I am developing a coprocessor under HBase 0.98.6. The client send a Put object to the coprocessor in Protobuf, when the coprocessor receive the message , it invokes ProtobufUtil.toPut to convert it to a Put object. Do various checking and then put it into HBase table. Now, I get a requirement to change the timestamp of that Put object, but I found no way to do this. I was first try to generate a new Put object with a new timestamp, and try to copy the old one into this new object. But I found given a Put object, I have no way to get ALL its cells out if I don't know the column family and column qualifier name in advance. In my case, those CF/Column names are random as user defined. So I stuck here. Could anyone have idea how to workaround this? The Mutation class has getTimestamp() method but no setTimestamp(). I wish there is a setTimestamp() for it. Is there any reason it is not provided? I hope in future release Mutation can expose a setTimestamp() method, is it possible? If so, my job will get much easier... Thanks, Ming
Re: Does 'online region merge' make regions unavailable for some time?
Hi, Considering this is called the *online* region merge, I would assume regions being merged never go offline during the merge and both regions being merged are available for reading and writing at all times, even during the merge though I don't get how writes would work if one region is being moved from one RS to another so maybe this is not truly online and writes are either rejected or buffered/blocked until the region is moved AND merged? Anyone knows for sure? I see this in one of the comments: Q: If one (or both) of the regions were receiving non-trivial load prior to this action, would client(s) be affected ? A: Yes, region would be off services in a short time, it is equal with moving region, e.g balance a region Also took a look at the patch: https://issues.apache.org/jira/secure/attachment/12574965/hbase-7403-trunkv33.patch And see: +/** + * The merging region A has been taken out of the server's online regions list. + */ +OFFLINED_REGION_A, ... and if you look for the word offline in the patch I think it's pretty clear that BOTH regions being merged do go offline at some point. I guess it could be after the merge, too, not before ... maybe others know? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Mon, Jan 19, 2015 at 4:17 AM, Vladimir Tretyakov vladimir.tretya...@sematext.com wrote: Hi, I have one question about 'online region merge' ( https://issues.apache.org/jira/browse/HBASE-7403). How I've understood regions which will be passed to merge method will be unavailable for some time. That means: 1. Some data will be unavailable some time. 2. If client will try to write data to these regions it will get exceptions. Are above sentences correct? Somebody can estimate time which 1 and 2 will be true? Seconds, minutes or hours? Is there any way to avoid 1 and 2? I am asking because now we have problem during time with number of regions (our key contains timestamp), count of regions growing constantly (splitting) and it become a cause of performance problem with time. For avoiding this effect we use 2 tables: 1. First table we use for writing and reading data. 2. Second we use only for reading data. After some time we truncate second table and rotate these tables (first become second and second become first). That allow us control count of regions, but solution looks a bit ugly, I looked at 'online region merge', but we can't live with restrictions I've described in first part of question. Can somebody help with answers? Thx, Vladimir Tretyakov.