Re: Scan performance

2013-06-21 Thread Anoop John
Have a look at FuzzyRowFilter -Anoop- On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean wrote: > I understand more, but have additional questions about the internals... > > So, in this example I have 6000 rows X 40 columns in this table. In this > test my startRow and stopRow do not narrow the scan c

RE: Scan performance

2013-06-21 Thread Tony Dean
I understand more, but have additional questions about the internals... So, in this example I have 6000 rows X 40 columns in this table. In this test my startRow and stopRow do not narrow the scan criterior therefore all 6000x40 KVs must be included in the search and thus read from disk and int

Re: Logging for MR Job

2013-06-21 Thread Stack
On Fri, Jun 21, 2013 at 4:41 PM, Joel Alexandre wrote: ... > > I'm my jar there is a log4j.properties file, but it is being ignored. > Your log4j.properties is in the right location inside the job jar? ( http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop ). St.Ack

Re: heap memory running

2013-06-21 Thread Jean-Marc Spaggiari
Also, if you assign "only" 24GB for the heap, the OS will still use some of the remaining memory as cache. And you will need some memory for the hadoop process too. JM 2013/6/21 Jean-Daniel Cryans : > 24GB is often cited as an upper limit, but YMMV. > > It also depends if you need memory for MapR

Re: Logging for MR Job

2013-06-21 Thread Jean-Marc Spaggiari
Worst case you can modify the log level directly into your code? JM 2013/6/21 Joel Alexandre : > hi, > > i'm running some Hbase MR jobs through the bin/hadoop jar command line. > > How can i change the log level for those specific execution without > changing hbase/conf/log4j.properties ? > > I'm

RE: Scan performance

2013-06-21 Thread Vladimir Rodionov
Lars, I thought that column family is the locality group and placement columns which are frequently accessed together into the same column family (locality group) is the obvious performance improvement tip. What are the "essential column families" for in this context? As for original question..

Logging for MR Job

2013-06-21 Thread Joel Alexandre
hi, i'm running some Hbase MR jobs through the bin/hadoop jar command line. How can i change the log level for those specific execution without changing hbase/conf/log4j.properties ? I'm my jar there is a log4j.properties file, but it is being ignored. Thanks, Joel

Re: Scan performance

2013-06-21 Thread lars hofhansl
HBase is a key value (KV) store. Each column is stored in its own KV, a row is just a set of KVs that happen to have the row key (which is the first part of the key). I tried to summarize this here: http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) In the StoreFiles all KVs ar

Re: heap memory running

2013-06-21 Thread Jean-Daniel Cryans
24GB is often cited as an upper limit, but YMMV. It also depends if you need memory for MapReduce, if you are using it. J-D On Wed, Jun 19, 2013 at 3:17 PM, prakash kadel wrote: > hi every one, > > i am quite new to base and java. I have a few questions. > > 1. on the web ui for hbase i have th

Re: Replication not suited for intensive write applications?

2013-06-21 Thread Jean-Daniel Cryans
I think that the same way writing with more clients helped throughput, writing with only 1 replication thread will hurt it. The clients in both cases have to read something (a file from HDFS or the WAL) then ship it, meaning that you can utilize the cluster better since a single client isn't consis

Scan performance

2013-06-21 Thread Tony Dean
Hi, I hope that you can shed some light on these 2 scenarios below. I have 2 small tables of 6000 rows. Table 1 has only 1 column in each of its rows. Table 2 has 40 columns in each of its rows. Other than that the two tables are identical. In both tables there is only 1 row that contains a matc

Re: Possibility of using timestamp as row key in HBase

2013-06-21 Thread yun peng
Thanks Asaf and Anoop. You are right, data in Memstore is already sorted so flush() would not block too much with current write stream to another Memstore... But wait flush() consumes disk IO, which I think would interferes with WAL writes. Say we have two Memstore A and B on one node. A is d

HBase Filterlist hierarchy not working

2013-06-21 Thread sandesh
Hi, Filter List with multiple filters is not working for us. We have a table 'country_details' with family 'country' having columns with prefix 'AGE' and 'SALARY'. Data is inserted as shown below. We need to get following rows and columns based on filters 'SRILANKA' if 'AGE' prefix column

Re: Replication not suited for intensive write applications?

2013-06-21 Thread lars hofhansl
Hmm... Yes. Was worth a try :)  Should've checked and I even wrote that part of the code. I have no good explanation then, and also no good suggestion about how to improve this. From: Asaf Mesika To: "user@hbase.apache.org" ; lars hofhansl Sent: Friday, J

Re: Replication not suited for intensive write applications?

2013-06-21 Thread Asaf Mesika
On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl wrote: > Another thought... > > I assume you only write to a single table, right? How large are your rows > on average? > > I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and the seconds around is around 800 bytes > > Replication

Re: Replication not suited for intensive write applications?

2013-06-21 Thread lars hofhansl
Another thought... I assume you only write to a single table, right? How large are your rows on average? Replication will send 64mb blocks by default (or 25000 edits, whatever is smaller). The default HTable buffer is 2mb only, so the slave RS receiving a block of edits (assuming it is a full

Re: bulk-load bug ?

2013-06-21 Thread fx_bull
Thanks! default mapper :org.apache.hadoop.hbase.mapreduce.TsvImporterMapper use the same ts, I can rewrite it to achieve my goal! 在 2013-6-21,下午4:44,Anoop John 写道: > he ts for each row in the raw data file.. While > running the tool we can specify which column (in raw data file) should be >

Re: Possibility of using timestamp as row key in HBase

2013-06-21 Thread Anoop John
You can specify max size to indicate the region split (when a region should get split) But this size is the size of the HFile. To be precise it is the size of the biggest HFile under that region. If u specify this size as 10G and when the region is having a file of size bigger than 10G the region w

Re: Replication not suited for intensive write applications?

2013-06-21 Thread lars hofhansl
Thanks for checking... Interesting. So talking to 3RSs as opposed to only 1 before had no effect on the throughput? Would be good to explore this a bit more. Since our RPC is not streaming, latency will effect throughout. In this case there is latency while all edits are shipped to the RS in the

Re: bulk-load bug ?

2013-06-21 Thread Anoop John
When adding data to HBase with same key, it is the timestamp (ts) which determines the version. Diff ts will make diff versions for the cell. But in case of bulk load using ImportTSV tool, the ts used by one mapper will be same. All the Puts created from it will have the same ts. The tool allows us

bulk-load bug ?

2013-06-21 Thread fx_bull
hello everyone When I use bulk-load to import datas to HBase, I found that if I have some rowkey with same values, only one of them imported to HBase! but I want to import all of them to HBase with different versions, How should I do? Original data mike18:20 mike16:20 mike1