Parse Scan to Base64
I want to parse a Scan object to Base64. I know that there is a method to do it, but this method in the beginning it was a scope of package and I can't update the version of this library and I can't use it. I have checked the code and it seems that it does more than encode to Base64 and do some things before to do it. Is it true? it's not so easy as to use java.util.Base64 or another class to do it for myself?
Trying to execute a put to docker cloudera hbase
I'm trying to insert a record in HBase. I have a docker cloudera quickstarter image. I can connect to zookeeper and hbase port from my machine to the docker expose ports with telnet (2181, 6, 60010). I have included in the classpath the files hbase-site, core-site, hdfs-site. When the put is executed the program stays blocked. Why? *TableName nameTable = TableName.valueOf(NAMETABLE); Connection conn = ConnectionFactory.createConnection(config); hTable = conn.getTable(nameTable); Put p = new Put(Bytes.toBytes(informations.getString("id")));p.add(Bytes.toBytes("informations"),Bytes.toBytes("carId"),Bytes.toBytes(carId))hTable.put(p); --> blocked*
Scan vs TableInputFormat to process data
Just to be sure, if I execute Scan inside Spark, the execution is goig through RegionServers and I get all the features of HBase/Scan (filters and so on), all the parallelization is in charge of the RegionServers (even I'm running the program with spark) If I use TableInputFormat I read all the column families (even If I don't want to) , not previous filter either, it's just open the files of a hbase table and process them completly. All te parallelization is in Spark and don't use HBase at all, it's just read in HDFS the files what HBase stored for a specific table. Am I missing something?
Re: Reading the whole table with MapReduce and Spark.
Another little doubt it's: if I use the class TableinputFormat to read a HBase table, Am I going to read the whole table? or data what haven't been flushed to storefiles it's not going to be read? El mié., 29 may. 2019 a las 0:14, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió: > it depends of the row, they did only share 5% of the qualifiers names. > Each row could have about 500-3000 columns in 3 column families. One of > them has 80% of the columns. > > The table has around 75M of rows. > > El mar., 28 may. 2019 a las 17:33, escribió: > >> Guillermo >> >> >> How large is your table? How many columns? >> >> >> Sincerely, >> >> Sean >> >> > On May 28, 2019 at 10:11 AM Guillermo Ortiz > mailto:konstt2...@gmail.com > wrote: >> > >> > >> > I have a doubt. When you process a Hbase table with MapReduce you >> could use >> > the TableInputFormat, I understand that it goes directly to HDFS >> files >> > (storesFiles in HDFS) , so you could do some filter in the map >> phase and >> > it's not the same to go through to the region servers to do some >> massive >> > queriesIt's possible to do the same using TableInputFormat with >> Spark and >> > it's more efficient than use scan with filters and so on (again) >> when you >> > want to do a massive query about all the table. Am I right? >> > >> >
Re: Reading the whole table with MapReduce and Spark.
it depends of the row, they did only share 5% of the qualifiers names. Each row could have about 500-3000 columns in 3 column families. One of them has 80% of the columns. The table has around 75M of rows. El mar., 28 may. 2019 a las 17:33, escribió: > Guillermo > > > How large is your table? How many columns? > > > Sincerely, > > Sean > > > On May 28, 2019 at 10:11 AM Guillermo Ortiz mailto:konstt2...@gmail.com > wrote: > > > > > > I have a doubt. When you process a Hbase table with MapReduce you > could use > > the TableInputFormat, I understand that it goes directly to HDFS > files > > (storesFiles in HDFS) , so you could do some filter in the map phase > and > > it's not the same to go through to the region servers to do some > massive > > queriesIt's possible to do the same using TableInputFormat with > Spark and > > it's more efficient than use scan with filters and so on (again) > when you > > want to do a massive query about all the table. Am I right? > > >
Reading the whole table with MapReduce and Spark.
I have a doubt. When you process a Hbase table with MapReduce you could use the TableInputFormat, I understand that it goes directly to HDFS files (storesFiles in HDFS) , so you could do some filter in the map phase and it's not the same to go through to the region servers to do some massive queriesIt's possible to do the same using TableInputFormat with Spark and it's more efficient than use scan with filters and so on (again) when you want to do a massive query about all the table. Am I right?
Re: table.put(singlePut) vs table.put(ListPut)
I thought that but I saw this https://stackoverflow.com/questions/28754077/is-hbase-batch-put-putlistput-faster-than-putput-what-is-the-capacity-of El jue., 9 may. 2019 19:08, Andor Molnar escribió: > Hi Guillermo, > > I'm not sure which version of HBase you're referring to, but it looks like > the API docs are quite clear about these 2 calls: > "Puts some data in the table." vs "Batch puts the specified data into the > table." > > Essentially you can pass multiple Put commands to the second call > (listPuts) and it uses the batch API, so it's probably more efficient for > bulk load scenarios. > > Regards, > Andor > > > > On Thu, May 9, 2019 at 4:05 PM Guillermo Ortiz > wrote: > > > I have seen that there are two methods to make a put, I would like to > know > > if there's any different between call table.put(put) or > table.put(listPuts) > > or under the hood is the same? > > > > My use case is to put many records with spark (kind of bulk load) > > >
Re: Max Number of versions in HBase
ok El mar., 7 may. 2019 a las 20:25, Ankit Singhal () escribió: > HBase versions generally impact the size of the store file (and could > impact performance while scanning), if you are in following limits and know > what you are doing, you should be fine > 1. Single Row size not exceeding your region size > 2. Max versions are in hundreds > 3. And you have HBASE-11544 in your HBase version. > > Regards, > Ankit Singhal > > On Tue, May 7, 2019 at 5:46 AM Guillermo Ortiz Fernández < > guillermo.ortiz.f...@gmail.com> wrote: > > > Hello, > > > > I'm thinking about the design of one hbase table and one approximation > has > > two CF and some of the columns could have hundreds of versions and others > > columns in the same CF only a few values. > > > > If I think how HBase saves data in storefiles it seems that there aren't > > problems about this, but I would like to ask about,, any problem about > save > > hundreds of versions? something to keep in mind? > > >
table.put(singlePut) vs table.put(ListPut)
I have seen that there are two methods to make a put, I would like to know if there's any different between call table.put(put) or table.put(listPuts) or under the hood is the same? My use case is to put many records with spark (kind of bulk load)
Max Number of versions in HBase
Hello, I'm thinking about the design of one hbase table and one approximation has two CF and some of the columns could have hundreds of versions and others columns in the same CF only a few values. If I think how HBase saves data in storefiles it seems that there aren't problems about this, but I would like to ask about,, any problem about save hundreds of versions? something to keep in mind?
Re: Number of tables in HBase.
Sorry, I mean just few column families,, 1 to 3 column families. But I don't know if it's a good idea to have too many tables in HBase. 2016-01-18 11:15 GMT+01:00 Guillermo Ortiz : > Hello, > > I know that a table should have a content number of CFs. How about the > number of tables in the same HBase cluster? it should be okay to have > dozens of tables or it's thought to have just a few number of tables? >
Number of tables in HBase.
Hello, I know that a table should have a content number of CFs. How about the number of tables in the same HBase cluster? it should be okay to have dozens of tables or it's thought to have just a few number of tables?
Re: Tool to to execute an benchmark for HBase.
I have coming back to the benchmark.I executde this command: yscb run hbase -P workflowA -p columnfamilty=cf -p operationcount=10 threads=32 And I got an performace of 2000op/seg What I did later it's to execute ten of those commands in parallel and I got about 18000op/sec in total. I don't get 2000op/sec for each ot them executions but I got about 1800op/sec I don't know if ti's an HBase question, but, I don't understand why I got more performance if I execute more commands in parallel if I already execute 32 threads. I took a look to the "top" and I saw that in the first (just one process) the CPU was working about 20-60% when I launch more processes the CPU it's about 400-500%. 2015-01-29 18:23 GMT+01:00 Guillermo Ortiz : > There's an option when you execute yscb to say how many clients > threads you want to use. I tried with 1/8/16/32. Those results are > with 16, the improvement 1vs8 it's pretty high not as much 16 to 32. > I only use one yscb, could it be that important? > > -threads : the number of client threads. By default, the YCSB Client > uses a single worker thread, but additional threads can be specified. > This is often done to increase the amount of load offered against the > database. > > 2015-01-29 17:27 GMT+01:00 Nishanth S : >> How many instances of ycsb do you run and how many threads do you use per >> instance.I guess these ops are per instance and you should get similar >> numbers if you run more instances.In short try running more workload >> instances... >> >> -Nishanth >> >> On Thu, Jan 29, 2015 at 8:49 AM, Guillermo Ortiz >> wrote: >> >>> Yes, I'm using 40%. i can't access to those data either. >>> I don't know how YSCB executes the reads and if they are random and >>> could take advange of the cache. >>> >>> Do you think that it's an acceptable performance? >>> >>> >>> 2015-01-29 16:26 GMT+01:00 Ted Yu : >>> > What's the value for hfile.block.cache.size ? >>> > >>> > By default it is 40%. You may want to increase its value if you're using >>> > default. >>> > >>> > Andrew published some ycsb results : >>> > http://people.apache.org/~apurtell/results-ycsb-0.98.8/ycsb >>> > -0.98.0-vs-0.98.8.pdf >>> > >>> > However, I couldn't access the above now. >>> > >>> > Cheers >>> > >>> > On Thu, Jan 29, 2015 at 7:14 AM, Guillermo Ortiz >>> > wrote: >>> > >>> >> Is there any result with that benchmark to compare?? >>> >> I'm executing the different workloads and for example for 100% Reads >>> >> in a table with 10Millions of records I only get an performance of >>> >> 2000operations/sec. I hoped much better performance but I could be >>> >> wrong. I'd like to know if it's a normal performance or I could have >>> >> something bad configured. >>> >> >>> >> >>> >> I have splitted the tabled and all the records are balanced and used >>> >> snappy. >>> >> The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32 >>> >> w/ Hyperthreading), 0.98.6-cdh5.3.0, >>> >> >>> >> RegionServer is executed with these parameters: >>> >> /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver >>> >> -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m >>> >> -Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936 >>> >> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled >>> >> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled >>> >> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh >>> >> -Dhbase.log.dir=/var/log/hbase >>> >> >>> >> >>> -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out >>> >> >>> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase >>> >> -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA >>> >> >>> >> >>> -Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native >>> >> -Dhbase.security.logger=INFO,RFAS >>> >> org.apache.hadoop.hbase.regionserver.HRegionServer start >>> >> >>> >> >>> >> The results for 100% reads are >>> >> [OVERALL], RunTime(ms), 42734.0 >>> &
Re: Tool to to execute an benchmark for HBase.
There's an option when you execute yscb to say how many clients threads you want to use. I tried with 1/8/16/32. Those results are with 16, the improvement 1vs8 it's pretty high not as much 16 to 32. I only use one yscb, could it be that important? -threads : the number of client threads. By default, the YCSB Client uses a single worker thread, but additional threads can be specified. This is often done to increase the amount of load offered against the database. 2015-01-29 17:27 GMT+01:00 Nishanth S : > How many instances of ycsb do you run and how many threads do you use per > instance.I guess these ops are per instance and you should get similar > numbers if you run more instances.In short try running more workload > instances... > > -Nishanth > > On Thu, Jan 29, 2015 at 8:49 AM, Guillermo Ortiz > wrote: > >> Yes, I'm using 40%. i can't access to those data either. >> I don't know how YSCB executes the reads and if they are random and >> could take advange of the cache. >> >> Do you think that it's an acceptable performance? >> >> >> 2015-01-29 16:26 GMT+01:00 Ted Yu : >> > What's the value for hfile.block.cache.size ? >> > >> > By default it is 40%. You may want to increase its value if you're using >> > default. >> > >> > Andrew published some ycsb results : >> > http://people.apache.org/~apurtell/results-ycsb-0.98.8/ycsb >> > -0.98.0-vs-0.98.8.pdf >> > >> > However, I couldn't access the above now. >> > >> > Cheers >> > >> > On Thu, Jan 29, 2015 at 7:14 AM, Guillermo Ortiz >> > wrote: >> > >> >> Is there any result with that benchmark to compare?? >> >> I'm executing the different workloads and for example for 100% Reads >> >> in a table with 10Millions of records I only get an performance of >> >> 2000operations/sec. I hoped much better performance but I could be >> >> wrong. I'd like to know if it's a normal performance or I could have >> >> something bad configured. >> >> >> >> >> >> I have splitted the tabled and all the records are balanced and used >> >> snappy. >> >> The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32 >> >> w/ Hyperthreading), 0.98.6-cdh5.3.0, >> >> >> >> RegionServer is executed with these parameters: >> >> /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver >> >> -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m >> >> -Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936 >> >> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled >> >> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled >> >> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh >> >> -Dhbase.log.dir=/var/log/hbase >> >> >> >> >> -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out >> >> >> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase >> >> -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA >> >> >> >> >> -Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native >> >> -Dhbase.security.logger=INFO,RFAS >> >> org.apache.hadoop.hbase.regionserver.HRegionServer start >> >> >> >> >> >> The results for 100% reads are >> >> [OVERALL], RunTime(ms), 42734.0 >> >> [OVERALL], Throughput(ops/sec), 2340.0570973931763 >> >> [UPDATE], Operations, 1.0 >> >> [UPDATE], AverageLatency(us), 103170.0 >> >> [UPDATE], MinLatency(us), 103168.0 >> >> [UPDATE], MaxLatency(us), 103171.0 >> >> [UPDATE], 95thPercentileLatency(ms), 103.0 >> >> [UPDATE], 99thPercentileLatency(ms), 103.0 >> >> [READ], Operations, 10.0 >> >> [READ], AverageLatency(us), 412.5534 >> >> [READ], AverageLatency(us,corrected), 581.6249026771276 >> >> [READ], MinLatency(us), 218.0 >> >> [READ], MaxLatency(us), 268383.0 >> >> [READ], MaxLatency(us,corrected), 268383.0 >> >> [READ], 95thPercentileLatency(ms), 0.0 >> >> [READ], 95thPercentileLatency(ms,corrected), 0.0 >> >> [READ], 99thPercentileLatency(ms), 0.0 >> >> [READ], 99thPercentileLatency(ms,corrected), 0.0 >> >> [READ], Return=0, 10 >> >> [CLEANUP], Operations, 1.0 >> >> [CLEANUP], AverageLatency(us), 103598.0 >
Re: Tool to to execute an benchmark for HBase.
Yes, I'm using 40%. i can't access to those data either. I don't know how YSCB executes the reads and if they are random and could take advange of the cache. Do you think that it's an acceptable performance? 2015-01-29 16:26 GMT+01:00 Ted Yu : > What's the value for hfile.block.cache.size ? > > By default it is 40%. You may want to increase its value if you're using > default. > > Andrew published some ycsb results : > http://people.apache.org/~apurtell/results-ycsb-0.98.8/ycsb > -0.98.0-vs-0.98.8.pdf > > However, I couldn't access the above now. > > Cheers > > On Thu, Jan 29, 2015 at 7:14 AM, Guillermo Ortiz > wrote: > >> Is there any result with that benchmark to compare?? >> I'm executing the different workloads and for example for 100% Reads >> in a table with 10Millions of records I only get an performance of >> 2000operations/sec. I hoped much better performance but I could be >> wrong. I'd like to know if it's a normal performance or I could have >> something bad configured. >> >> >> I have splitted the tabled and all the records are balanced and used >> snappy. >> The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32 >> w/ Hyperthreading), 0.98.6-cdh5.3.0, >> >> RegionServer is executed with these parameters: >> /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver >> -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m >> -Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936 >> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled >> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled >> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh >> -Dhbase.log.dir=/var/log/hbase >> >> -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out >> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase >> -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA >> >> -Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native >> -Dhbase.security.logger=INFO,RFAS >> org.apache.hadoop.hbase.regionserver.HRegionServer start >> >> >> The results for 100% reads are >> [OVERALL], RunTime(ms), 42734.0 >> [OVERALL], Throughput(ops/sec), 2340.0570973931763 >> [UPDATE], Operations, 1.0 >> [UPDATE], AverageLatency(us), 103170.0 >> [UPDATE], MinLatency(us), 103168.0 >> [UPDATE], MaxLatency(us), 103171.0 >> [UPDATE], 95thPercentileLatency(ms), 103.0 >> [UPDATE], 99thPercentileLatency(ms), 103.0 >> [READ], Operations, 10.0 >> [READ], AverageLatency(us), 412.5534 >> [READ], AverageLatency(us,corrected), 581.6249026771276 >> [READ], MinLatency(us), 218.0 >> [READ], MaxLatency(us), 268383.0 >> [READ], MaxLatency(us,corrected), 268383.0 >> [READ], 95thPercentileLatency(ms), 0.0 >> [READ], 95thPercentileLatency(ms,corrected), 0.0 >> [READ], 99thPercentileLatency(ms), 0.0 >> [READ], 99thPercentileLatency(ms,corrected), 0.0 >> [READ], Return=0, 10 >> [CLEANUP], Operations, 1.0 >> [CLEANUP], AverageLatency(us), 103598.0 >> [CLEANUP], MinLatency(us), 103596.0 >> [CLEANUP], MaxLatency(us), 103599.0 >> [CLEANUP], 95thPercentileLatency(ms), 103.0 >> [CLEANUP], 99thPercentileLatency(ms), 103.0 >> >> hbase(main):030:0> describe 'username' >> DESCRIPTION >> ENABLED >> 'username', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER >> => 'ROW', REPLICATION_SCOPE => '0', true >> VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', TTL >> => 'FOREVER', KEEP_DELETED_CELLS => ' >> false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} >> 1 row(s) in 0.0170 seconds >> >> 2015-01-29 5:27 GMT+01:00 Ted Yu : >> > Maybe ask on Cassandra mailing list for the benchmark tool they use ? >> > >> > Cheers >> > >> > On Wed, Jan 28, 2015 at 1:23 PM, Guillermo Ortiz >> > wrote: >> > >> >> I was checking that web, do you know if there's another possibility >> >> since last updated for Cassandra was two years ago and I'd like to >> >> compare bothof them with kind of same tool/code. >> >> >> >> 2015-01-28 22:10 GMT+01:00 Ted Yu : >> >> > Guillermo: >> >> > If you use hbase 0.98.x, please consid
Re: Tool to to execute an benchmark for HBase.
Is there any result with that benchmark to compare?? I'm executing the different workloads and for example for 100% Reads in a table with 10Millions of records I only get an performance of 2000operations/sec. I hoped much better performance but I could be wrong. I'd like to know if it's a normal performance or I could have something bad configured. I have splitted the tabled and all the records are balanced and used snappy. The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32 w/ Hyperthreading), 0.98.6-cdh5.3.0, RegionServer is executed with these parameters: /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out -Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start The results for 100% reads are [OVERALL], RunTime(ms), 42734.0 [OVERALL], Throughput(ops/sec), 2340.0570973931763 [UPDATE], Operations, 1.0 [UPDATE], AverageLatency(us), 103170.0 [UPDATE], MinLatency(us), 103168.0 [UPDATE], MaxLatency(us), 103171.0 [UPDATE], 95thPercentileLatency(ms), 103.0 [UPDATE], 99thPercentileLatency(ms), 103.0 [READ], Operations, 10.0 [READ], AverageLatency(us), 412.5534 [READ], AverageLatency(us,corrected), 581.6249026771276 [READ], MinLatency(us), 218.0 [READ], MaxLatency(us), 268383.0 [READ], MaxLatency(us,corrected), 268383.0 [READ], 95thPercentileLatency(ms), 0.0 [READ], 95thPercentileLatency(ms,corrected), 0.0 [READ], 99thPercentileLatency(ms), 0.0 [READ], 99thPercentileLatency(ms,corrected), 0.0 [READ], Return=0, 10 [CLEANUP], Operations, 1.0 [CLEANUP], AverageLatency(us), 103598.0 [CLEANUP], MinLatency(us), 103596.0 [CLEANUP], MaxLatency(us), 103599.0 [CLEANUP], 95thPercentileLatency(ms), 103.0 [CLEANUP], 99thPercentileLatency(ms), 103.0 hbase(main):030:0> describe 'username' DESCRIPTION ENABLED 'username', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', true VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => ' false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 1 row(s) in 0.0170 seconds 2015-01-29 5:27 GMT+01:00 Ted Yu : > Maybe ask on Cassandra mailing list for the benchmark tool they use ? > > Cheers > > On Wed, Jan 28, 2015 at 1:23 PM, Guillermo Ortiz > wrote: > >> I was checking that web, do you know if there's another possibility >> since last updated for Cassandra was two years ago and I'd like to >> compare bothof them with kind of same tool/code. >> >> 2015-01-28 22:10 GMT+01:00 Ted Yu : >> > Guillermo: >> > If you use hbase 0.98.x, please consider Andrew's ycsb repo: >> > >> > https://github.com/apurtell/ycsb/tree/new_hbase_client >> > >> > Cheers >> > >> > On Wed, Jan 28, 2015 at 12:41 PM, Nishanth S >> > wrote: >> > >> >> You can use ycsb for this purpose.See here >> >> >> >> https://github.com/brianfrankcooper/YCSB/wiki/Getting-Started >> >> -Nishanth >> >> >> >> On Wed, Jan 28, 2015 at 1:37 PM, Guillermo Ortiz >> >> wrote: >> >> >> >> > Hi, >> >> > >> >> > I'd like to do some benchmarks fo HBase but I don't know what tool >> >> > could use. I started to make some code but I guess that there're some >> >> > easier. >> >> > >> >> > I've taken a look to JMeter, but I guess that I'd attack directly from >> >> > Java, JMeter looks great but I don't know if it fits well in this >> >> > scenario. What tool could I use to take some measures as time to >> >> > response some read and write request, etc. I'd like that to be able to >> >> > make the same benchmarks to Cassandra. >> >> > >> >> >>
Re: Tool to to execute an benchmark for HBase.
I was checking that web, do you know if there's another possibility since last updated for Cassandra was two years ago and I'd like to compare bothof them with kind of same tool/code. 2015-01-28 22:10 GMT+01:00 Ted Yu : > Guillermo: > If you use hbase 0.98.x, please consider Andrew's ycsb repo: > > https://github.com/apurtell/ycsb/tree/new_hbase_client > > Cheers > > On Wed, Jan 28, 2015 at 12:41 PM, Nishanth S > wrote: > >> You can use ycsb for this purpose.See here >> >> https://github.com/brianfrankcooper/YCSB/wiki/Getting-Started >> -Nishanth >> >> On Wed, Jan 28, 2015 at 1:37 PM, Guillermo Ortiz >> wrote: >> >> > Hi, >> > >> > I'd like to do some benchmarks fo HBase but I don't know what tool >> > could use. I started to make some code but I guess that there're some >> > easier. >> > >> > I've taken a look to JMeter, but I guess that I'd attack directly from >> > Java, JMeter looks great but I don't know if it fits well in this >> > scenario. What tool could I use to take some measures as time to >> > response some read and write request, etc. I'd like that to be able to >> > make the same benchmarks to Cassandra. >> > >>
Tool to to execute an benchmark for HBase.
Hi, I'd like to do some benchmarks fo HBase but I don't know what tool could use. I started to make some code but I guess that there're some easier. I've taken a look to JMeter, but I guess that I'd attack directly from Java, JMeter looks great but I don't know if it fits well in this scenario. What tool could I use to take some measures as time to response some read and write request, etc. I'd like that to be able to make the same benchmarks to Cassandra.
Paint dashboards from HBase
Is there any tool to draw data from Hbase to a dashboard like Kibana? I have been looking for, but I didn't found a tool which fits directly with HBase for that purpose.
Hbase with Phoenix, when?
I just read about Phoenix project. People from Phoenix talks very well about it, the question is,,, they give you SQL and it's suppose pretty fast.. any case where is it better to use just HBase without Phoenix? it is in general faster than execute native scans and make your own coprocessor?
Re: Scan vs Parallel scan.
I attach the code than I'm executing. I don't have accss to the generator to HBase. In the last benchmark, simple scan takes about 4 times less than this version. With that version is available just to do complete scans. I have been trying a complete scan of a HTable with 100.000 rows and it takes less than one second, is it not too fast??? 2014-09-14 20:21 GMT+02:00 Guillermo Ortiz : > I don't have the code here. But I created a class RegionScanner, this > class does a complete scan of a region. So I have to set the start and stop > keys. the start and stop key are the limits of that region. > > El domingo, 14 de septiembre de 2014, Anoop John > escribió: > > Again full code snippet can better speak. >> >> But not getting what u r doing with below code >> >> private List generatePartitions() { >> List regionScanners = new >> ArrayList(); >> byte[] startKey; >> byte[] stopKey; >> HConnection connection = null; >> HBaseAdmin hbaseAdmin = null; >> try { >> connection = HConnectionManager. >> createConnection(HBaseConfiguration.create()); >> hbaseAdmin = new HBaseAdmin(connection); >> List regions = >> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >> RegionScanner regionScanner = null; >> for (HRegionInfo region : regions) { >> >> startKey = region.getStartKey(); >> stopKey = region.getEndKey(); >> >> regionScanner = new RegionScanner(startKey, stopKey, >> scanConfiguration); >> // regionScanner = createRegionScanner(startKey, stopKey); >> if (regionScanner != null) { >> regionScanners.add(regionScanner); >> } >> } >> >> And I execute the RegionScanner with this: >> public List call() throws Exception { >> HConnection connection = >> HConnectionManager. >> createConnection(HBaseConfiguration.create()); >> HTableInterface table = >> connection.getTable(configuration.getTable()); >> >> Scan scan = new Scan(startKey, stopKey); >> scan.setBatch(configuration.getBatch()); >> scan.setCaching(configuration.getCaching()); >> ResultScanner resultScanner = table.getScanner(scan); >> >> >> What is this part? >> new RegionScanner(startKey, stopKey, >> scanConfiguration); >> >> >> >>Scan scan = new Scan(startKey, stopKey); >> scan.setBatch(configuration. >> getBatch()); >> scan.setCaching(configuration.getCaching()); >> ResultScanner resultScanner = table.getScanner(scan); >> >> >> And not setting start and stop rows to this Scan object? !! >> >> >> Sorry If I missed some parts from ur code. >> >> -Anoop- >> >> >> On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz >> wrote: >> >> > I don't have the code here,, but I'll put the code in a couple of days. >> I >> > have to check the executeservice again! I don't remember exactly how I >> did. >> > >> > I'm using Hbase 0.98. >> > >> > El domingo, 14 de septiembre de 2014, lars hofhansl >> > escribió: >> > >> > > What specific version of 0.94 are you using? >> > > >> > > In general, if you have multiple spindles (disks) and/or multiple CPU >> > > cores at the region server you should benefits from keeping multiple >> > region >> > > server handler threads busy. I have experimented with this before and >> > saw a >> > > close to linear speed up (up to the point where all disks/core were >> > busy). >> > > Obviously this also assuming this is the only load you throw at the >> > servers >> > > at this point. >> > > >> > > Can you post your complete code to pastebin? Maybe even with some >> code to >> > > seed the data? >> > > How do you run your callables? Did you configure the ExecuteService >> > > correctly (assuming you use one to run your callables)? >> > > >> > > Then we can run it and have a look. >> > > >> > > Thanks. >> > > >> > > -- Lars >> > > >> > > >> > > - Original Message - >> > > From: Guillermo Ortiz > >> > > To: "user@hbase.apache.org " > > > &g
Re: Scan vs Parallel scan.
I don't have the code here. But I created a class RegionScanner, this class does a complete scan of a region. So I have to set the start and stop keys. the start and stop key are the limits of that region. El domingo, 14 de septiembre de 2014, Anoop John escribió: > Again full code snippet can better speak. > > But not getting what u r doing with below code > > private List generatePartitions() { > List regionScanners = new > ArrayList(); > byte[] startKey; > byte[] stopKey; > HConnection connection = null; > HBaseAdmin hbaseAdmin = null; > try { > connection = HConnectionManager. > createConnection(HBaseConfiguration.create()); > hbaseAdmin = new HBaseAdmin(connection); > List regions = > hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > RegionScanner regionScanner = null; > for (HRegionInfo region : regions) { > > startKey = region.getStartKey(); > stopKey = region.getEndKey(); > > regionScanner = new RegionScanner(startKey, stopKey, > scanConfiguration); > // regionScanner = createRegionScanner(startKey, stopKey); > if (regionScanner != null) { > regionScanners.add(regionScanner); > } > } > > And I execute the RegionScanner with this: > public List call() throws Exception { > HConnection connection = > HConnectionManager. > createConnection(HBaseConfiguration.create()); > HTableInterface table = > connection.getTable(configuration.getTable()); > > Scan scan = new Scan(startKey, stopKey); > scan.setBatch(configuration.getBatch()); > scan.setCaching(configuration.getCaching()); > ResultScanner resultScanner = table.getScanner(scan); > > > What is this part? > new RegionScanner(startKey, stopKey, > scanConfiguration); > > > >>Scan scan = new Scan(startKey, stopKey); > scan.setBatch(configuration. > getBatch()); > scan.setCaching(configuration.getCaching()); > ResultScanner resultScanner = table.getScanner(scan); > > > And not setting start and stop rows to this Scan object? !! > > > Sorry If I missed some parts from ur code. > > -Anoop- > > > On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz > > wrote: > > > I don't have the code here,, but I'll put the code in a couple of days. I > > have to check the executeservice again! I don't remember exactly how I > did. > > > > I'm using Hbase 0.98. > > > > El domingo, 14 de septiembre de 2014, lars hofhansl > > > escribió: > > > > > What specific version of 0.94 are you using? > > > > > > In general, if you have multiple spindles (disks) and/or multiple CPU > > > cores at the region server you should benefits from keeping multiple > > region > > > server handler threads busy. I have experimented with this before and > > saw a > > > close to linear speed up (up to the point where all disks/core were > > busy). > > > Obviously this also assuming this is the only load you throw at the > > servers > > > at this point. > > > > > > Can you post your complete code to pastebin? Maybe even with some code > to > > > seed the data? > > > How do you run your callables? Did you configure the ExecuteService > > > correctly (assuming you use one to run your callables)? > > > > > > Then we can run it and have a look. > > > > > > Thanks. > > > > > > -- Lars > > > > > > > > > - Original Message - > > > From: Guillermo Ortiz > > > > > To: "user@hbase.apache.org " < > user@hbase.apache.org > > > > > > > Cc: > > > Sent: Saturday, September 13, 2014 4:49 PM > > > Subject: Re: Scan vs Parallel scan. > > > > > > What am I missing?? > > > > > > > > > > > > > > > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz > > > >: > > > > > > > For an partial scan, I guess that I call to the RS to get data, it > > starts > > > > looking in the store files and recollecting the data. (It doesn't > write > > > to > > > > the blockcache in both cases). It has ready the data and it gives to > > the > > > > client the data step by step, I mean,,, it depends the caching and > > > batching > > > > parameters. > > &
Re: Scan vs Parallel scan.
I don't have the code here,, but I'll put the code in a couple of days. I have to check the executeservice again! I don't remember exactly how I did. I'm using Hbase 0.98. El domingo, 14 de septiembre de 2014, lars hofhansl escribió: > What specific version of 0.94 are you using? > > In general, if you have multiple spindles (disks) and/or multiple CPU > cores at the region server you should benefits from keeping multiple region > server handler threads busy. I have experimented with this before and saw a > close to linear speed up (up to the point where all disks/core were busy). > Obviously this also assuming this is the only load you throw at the servers > at this point. > > Can you post your complete code to pastebin? Maybe even with some code to > seed the data? > How do you run your callables? Did you configure the ExecuteService > correctly (assuming you use one to run your callables)? > > Then we can run it and have a look. > > Thanks. > > -- Lars > > > - Original Message - > From: Guillermo Ortiz > > To: "user@hbase.apache.org " > > Cc: > Sent: Saturday, September 13, 2014 4:49 PM > Subject: Re: Scan vs Parallel scan. > > What am I missing?? > > > > > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz >: > > > For an partial scan, I guess that I call to the RS to get data, it starts > > looking in the store files and recollecting the data. (It doesn't write > to > > the blockcache in both cases). It has ready the data and it gives to the > > client the data step by step, I mean,,, it depends the caching and > batching > > parameters. > > > > Big differences that I see... > > I'm opening more connections to the Table, one for Region. > > > > I should check the single table scan, it looks like it does partial scans > > sequentially. Since you can see on the HBase Master how the request > > increase one after another, not all in the same time. > > > > 2014-09-12 15:23 GMT+02:00 Michael Segel >: > > > >> It doesn’t matter which RS, but that you have 1 thread for each region. > >> > >> So for each thread, what’s happening. > >> Step by step, what is the code doing. > >> > >> Now you’re comparing this against a single table scan, right? > >> What’s happening in the table scan…? > >> > >> > >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz > > >> wrote: > >> > >> > Right, My table for example has keys between 0-9. in three regions > >> > 0-2,3-7,7-9 > >> > I lauch three partial scans in parallel. The scans that I'm executing > >> are: > >> > scan(0,2), scan(3,7), scan(7,9). > >> > Each region is if a different RS, so each thread goes to different RS. > >> It's > >> > not exactly like that, but on the benchmark case it's like it's > working. > >> > > >> > Really the code will execute a thread for each Region not for each > >> > RegionServer. But in the test I only have two regions for > regionServer. > >> I > >> > dont' think that's an important point, there're two threads for RS. > >> > > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel >: > >> > > >> >> Ok, lets again take a step back… > >> >> > >> >> So you are comparing your partial scan(s) against a full table scan? > >> >> > >> >> If I understood your question, you launch 3 partial scans where you > set > >> >> the start row and then end row of each scan, right? > >> >> > >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz > > >> wrote: > >> >> > >> >>> Okay, then, the partial scan doesn't work as I think. > >> >>> How could it exceed the limit of a single region if I calculate the > >> >> limits? > >> >>> > >> >>> > >> >>> The only bad point that I see it's that If a region server has three > >> >>> regions of the same table, I'm executing three partial scans about > >> this > >> >> RS > >> >>> and they could compete for resources (network, etc..) on this node. > >> It'd > >> >> be > >> >>> better to have one thread for RS. But, that doesn't answer your > >> >> questions. > >> >>> > >> >>> I keep thinking... > >> >>>
Re: Scan vs Parallel scan.
What am I missing?? 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz : > For an partial scan, I guess that I call to the RS to get data, it starts > looking in the store files and recollecting the data. (It doesn't write to > the blockcache in both cases). It has ready the data and it gives to the > client the data step by step, I mean,,, it depends the caching and batching > parameters. > > Big differences that I see... > I'm opening more connections to the Table, one for Region. > > I should check the single table scan, it looks like it does partial scans > sequentially. Since you can see on the HBase Master how the request > increase one after another, not all in the same time. > > 2014-09-12 15:23 GMT+02:00 Michael Segel : > >> It doesn’t matter which RS, but that you have 1 thread for each region. >> >> So for each thread, what’s happening. >> Step by step, what is the code doing. >> >> Now you’re comparing this against a single table scan, right? >> What’s happening in the table scan…? >> >> >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz >> wrote: >> >> > Right, My table for example has keys between 0-9. in three regions >> > 0-2,3-7,7-9 >> > I lauch three partial scans in parallel. The scans that I'm executing >> are: >> > scan(0,2), scan(3,7), scan(7,9). >> > Each region is if a different RS, so each thread goes to different RS. >> It's >> > not exactly like that, but on the benchmark case it's like it's working. >> > >> > Really the code will execute a thread for each Region not for each >> > RegionServer. But in the test I only have two regions for regionServer. >> I >> > dont' think that's an important point, there're two threads for RS. >> > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel : >> > >> >> Ok, lets again take a step back… >> >> >> >> So you are comparing your partial scan(s) against a full table scan? >> >> >> >> If I understood your question, you launch 3 partial scans where you set >> >> the start row and then end row of each scan, right? >> >> >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz >> wrote: >> >> >> >>> Okay, then, the partial scan doesn't work as I think. >> >>> How could it exceed the limit of a single region if I calculate the >> >> limits? >> >>> >> >>> >> >>> The only bad point that I see it's that If a region server has three >> >>> regions of the same table, I'm executing three partial scans about >> this >> >> RS >> >>> and they could compete for resources (network, etc..) on this node. >> It'd >> >> be >> >>> better to have one thread for RS. But, that doesn't answer your >> >> questions. >> >>> >> >>> I keep thinking... >> >>> >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel : >> >>> >> >>>> Hi, >> >>>> >> >>>> I wanted to take a step back from the actual code and to stop and >> think >> >>>> about what you are doing and what HBase is doing under the covers. >> >>>> >> >>>> So in your code, you are asking HBase to do 3 separate scans and then >> >> you >> >>>> take the result set back and join it. >> >>>> >> >>>> What does HBase do when it does a range scan? >> >>>> What happens when that range scan exceeds a single region? >> >>>> >> >>>> If you answer those questions… you’ll have your answer. >> >>>> >> >>>> HTH >> >>>> >> >>>> -Mike >> >>>> >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz >> >> wrote: >> >>>> >> >>>>> It's not all the code, I set things like these as well: >> >>>>> scan.setMaxVersions(); >> >>>>> scan.setCacheBlocks(false); >> >>>>> ... >> >>>>> >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz : >> >>>>> >> >>>>>> yes, that is. I have changed the HBase version to 0.98 >> >>>>>> >> >>>>>> I got the start and stop keys with this method: >> >
Re: Scan vs Parallel scan.
For an partial scan, I guess that I call to the RS to get data, it starts looking in the store files and recollecting the data. (It doesn't write to the blockcache in both cases). It has ready the data and it gives to the client the data step by step, I mean,,, it depends the caching and batching parameters. Big differences that I see... I'm opening more connections to the Table, one for Region. I should check the single table scan, it looks like it does partial scans sequentially. Since you can see on the HBase Master how the request increase one after another, not all in the same time. 2014-09-12 15:23 GMT+02:00 Michael Segel : > It doesn’t matter which RS, but that you have 1 thread for each region. > > So for each thread, what’s happening. > Step by step, what is the code doing. > > Now you’re comparing this against a single table scan, right? > What’s happening in the table scan…? > > > On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz wrote: > > > Right, My table for example has keys between 0-9. in three regions > > 0-2,3-7,7-9 > > I lauch three partial scans in parallel. The scans that I'm executing > are: > > scan(0,2), scan(3,7), scan(7,9). > > Each region is if a different RS, so each thread goes to different RS. > It's > > not exactly like that, but on the benchmark case it's like it's working. > > > > Really the code will execute a thread for each Region not for each > > RegionServer. But in the test I only have two regions for regionServer. I > > dont' think that's an important point, there're two threads for RS. > > > > 2014-09-12 14:48 GMT+02:00 Michael Segel : > > > >> Ok, lets again take a step back… > >> > >> So you are comparing your partial scan(s) against a full table scan? > >> > >> If I understood your question, you launch 3 partial scans where you set > >> the start row and then end row of each scan, right? > >> > >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz > wrote: > >> > >>> Okay, then, the partial scan doesn't work as I think. > >>> How could it exceed the limit of a single region if I calculate the > >> limits? > >>> > >>> > >>> The only bad point that I see it's that If a region server has three > >>> regions of the same table, I'm executing three partial scans about > this > >> RS > >>> and they could compete for resources (network, etc..) on this node. > It'd > >> be > >>> better to have one thread for RS. But, that doesn't answer your > >> questions. > >>> > >>> I keep thinking... > >>> > >>> 2014-09-12 9:40 GMT+02:00 Michael Segel : > >>> > >>>> Hi, > >>>> > >>>> I wanted to take a step back from the actual code and to stop and > think > >>>> about what you are doing and what HBase is doing under the covers. > >>>> > >>>> So in your code, you are asking HBase to do 3 separate scans and then > >> you > >>>> take the result set back and join it. > >>>> > >>>> What does HBase do when it does a range scan? > >>>> What happens when that range scan exceeds a single region? > >>>> > >>>> If you answer those questions… you’ll have your answer. > >>>> > >>>> HTH > >>>> > >>>> -Mike > >>>> > >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz > >> wrote: > >>>> > >>>>> It's not all the code, I set things like these as well: > >>>>> scan.setMaxVersions(); > >>>>> scan.setCacheBlocks(false); > >>>>> ... > >>>>> > >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz : > >>>>> > >>>>>> yes, that is. I have changed the HBase version to 0.98 > >>>>>> > >>>>>> I got the start and stop keys with this method: > >>>>>> private List generatePartitions() { > >>>>>> List regionScanners = new > >>>>>> ArrayList(); > >>>>>> byte[] startKey; > >>>>>> byte[] stopKey; > >>>>>> HConnection connection = null; > >>>>>> HBaseAdmin hbaseAdmin = null; > >>>>>> try { > >>>>>> connection = HConnectionManager. > >>
Re: Scan vs Parallel scan.
Right, My table for example has keys between 0-9. in three regions 0-2,3-7,7-9 I lauch three partial scans in parallel. The scans that I'm executing are: scan(0,2), scan(3,7), scan(7,9). Each region is if a different RS, so each thread goes to different RS. It's not exactly like that, but on the benchmark case it's like it's working. Really the code will execute a thread for each Region not for each RegionServer. But in the test I only have two regions for regionServer. I dont' think that's an important point, there're two threads for RS. 2014-09-12 14:48 GMT+02:00 Michael Segel : > Ok, lets again take a step back… > > So you are comparing your partial scan(s) against a full table scan? > > If I understood your question, you launch 3 partial scans where you set > the start row and then end row of each scan, right? > > On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz wrote: > > > Okay, then, the partial scan doesn't work as I think. > > How could it exceed the limit of a single region if I calculate the > limits? > > > > > > The only bad point that I see it's that If a region server has three > > regions of the same table, I'm executing three partial scans about this > RS > > and they could compete for resources (network, etc..) on this node. It'd > be > > better to have one thread for RS. But, that doesn't answer your > questions. > > > > I keep thinking... > > > > 2014-09-12 9:40 GMT+02:00 Michael Segel : > > > >> Hi, > >> > >> I wanted to take a step back from the actual code and to stop and think > >> about what you are doing and what HBase is doing under the covers. > >> > >> So in your code, you are asking HBase to do 3 separate scans and then > you > >> take the result set back and join it. > >> > >> What does HBase do when it does a range scan? > >> What happens when that range scan exceeds a single region? > >> > >> If you answer those questions… you’ll have your answer. > >> > >> HTH > >> > >> -Mike > >> > >> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz > wrote: > >> > >>> It's not all the code, I set things like these as well: > >>> scan.setMaxVersions(); > >>> scan.setCacheBlocks(false); > >>> ... > >>> > >>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz : > >>> > >>>> yes, that is. I have changed the HBase version to 0.98 > >>>> > >>>> I got the start and stop keys with this method: > >>>> private List generatePartitions() { > >>>> List regionScanners = new > >>>> ArrayList(); > >>>> byte[] startKey; > >>>> byte[] stopKey; > >>>> HConnection connection = null; > >>>> HBaseAdmin hbaseAdmin = null; > >>>> try { > >>>> connection = HConnectionManager. > >>>> createConnection(HBaseConfiguration.create()); > >>>> hbaseAdmin = new HBaseAdmin(connection); > >>>> List regions = > >>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > >>>> RegionScanner regionScanner = null; > >>>> for (HRegionInfo region : regions) { > >>>> > >>>> startKey = region.getStartKey(); > >>>> stopKey = region.getEndKey(); > >>>> > >>>> regionScanner = new RegionScanner(startKey, stopKey, > >>>> scanConfiguration); > >>>> // regionScanner = createRegionScanner(startKey, > >> stopKey); > >>>> if (regionScanner != null) { > >>>> regionScanners.add(regionScanner); > >>>> } > >>>> } > >>>> > >>>> And I execute the RegionScanner with this: > >>>> public List call() throws Exception { > >>>> HConnection connection = > >>>> HConnectionManager.createConnection(HBaseConfiguration.create()); > >>>> HTableInterface table = > >>>> connection.getTable(configuration.getTable()); > >>>> > >>>> Scan scan = new Scan(startKey, stopKey); > >>>> scan.setBatch(configuration.getBatch()); > >>>> scan.setCaching(configuration.getCaching()); > >>>> ResultScanner resu
Re: Scan vs Parallel scan.
Okay, then, the partial scan doesn't work as I think. How could it exceed the limit of a single region if I calculate the limits? The only bad point that I see it's that If a region server has three regions of the same table, I'm executing three partial scans about this RS and they could compete for resources (network, etc..) on this node. It'd be better to have one thread for RS. But, that doesn't answer your questions. I keep thinking... 2014-09-12 9:40 GMT+02:00 Michael Segel : > Hi, > > I wanted to take a step back from the actual code and to stop and think > about what you are doing and what HBase is doing under the covers. > > So in your code, you are asking HBase to do 3 separate scans and then you > take the result set back and join it. > > What does HBase do when it does a range scan? > What happens when that range scan exceeds a single region? > > If you answer those questions… you’ll have your answer. > > HTH > > -Mike > > On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz wrote: > > > It's not all the code, I set things like these as well: > > scan.setMaxVersions(); > > scan.setCacheBlocks(false); > > ... > > > > 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz : > > > >> yes, that is. I have changed the HBase version to 0.98 > >> > >> I got the start and stop keys with this method: > >> private List generatePartitions() { > >>List regionScanners = new > >> ArrayList(); > >>byte[] startKey; > >>byte[] stopKey; > >>HConnection connection = null; > >>HBaseAdmin hbaseAdmin = null; > >>try { > >>connection = HConnectionManager. > >> createConnection(HBaseConfiguration.create()); > >>hbaseAdmin = new HBaseAdmin(connection); > >>List regions = > >> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > >>RegionScanner regionScanner = null; > >>for (HRegionInfo region : regions) { > >> > >>startKey = region.getStartKey(); > >>stopKey = region.getEndKey(); > >> > >>regionScanner = new RegionScanner(startKey, stopKey, > >> scanConfiguration); > >>// regionScanner = createRegionScanner(startKey, > stopKey); > >>if (regionScanner != null) { > >>regionScanners.add(regionScanner); > >>} > >>} > >> > >> And I execute the RegionScanner with this: > >> public List call() throws Exception { > >>HConnection connection = > >> HConnectionManager.createConnection(HBaseConfiguration.create()); > >>HTableInterface table = > >> connection.getTable(configuration.getTable()); > >> > >>Scan scan = new Scan(startKey, stopKey); > >>scan.setBatch(configuration.getBatch()); > >>scan.setCaching(configuration.getCaching()); > >>ResultScanner resultScanner = table.getScanner(scan); > >> > >>List results = new ArrayList(); > >>for (Result result : resultScanner) { > >>results.add(result); > >>} > >> > >>connection.close(); > >>table.close(); > >> > >>return results; > >>} > >> > >> They implement Callable. > >> > >> > >> 2014-09-12 9:26 GMT+02:00 Michael Segel : > >> > >>> Lets take a step back…. > >>> > >>> Your parallel scan is having the client create N threads where in each > >>> thread, you’re doing a partial scan of the table where each partial > scan > >>> takes the first and last row of each region? > >>> > >>> Is that correct? > >>> > >>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz > >>> wrote: > >>> > >>>> I was checking a little bit more about,, I checked the cluster and > data > >>> is > >>>> store in three different regions servers, each one in a differente > node. > >>>> So, I guess the threads go to different hard-disks. > >>>> > >>>> If someone has an idea or suggestion.. why it's faster a single scan > >>> than > >>>> this implementation. I based on this implementation > >>>> https://github.com/zygm0nt/hbase-distributed-search > >>>> > >
Re: Scan vs Parallel scan.
It's not all the code, I set things like these as well: scan.setMaxVersions(); scan.setCacheBlocks(false); ... 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz : > yes, that is. I have changed the HBase version to 0.98 > > I got the start and stop keys with this method: > private List generatePartitions() { > List regionScanners = new > ArrayList(); > byte[] startKey; > byte[] stopKey; > HConnection connection = null; > HBaseAdmin hbaseAdmin = null; > try { > connection = HConnectionManager. > createConnection(HBaseConfiguration.create()); > hbaseAdmin = new HBaseAdmin(connection); > List regions = > hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > RegionScanner regionScanner = null; > for (HRegionInfo region : regions) { > > startKey = region.getStartKey(); > stopKey = region.getEndKey(); > > regionScanner = new RegionScanner(startKey, stopKey, > scanConfiguration); > // regionScanner = createRegionScanner(startKey, stopKey); > if (regionScanner != null) { > regionScanners.add(regionScanner); > } > } > > And I execute the RegionScanner with this: > public List call() throws Exception { > HConnection connection = > HConnectionManager.createConnection(HBaseConfiguration.create()); > HTableInterface table = > connection.getTable(configuration.getTable()); > > Scan scan = new Scan(startKey, stopKey); > scan.setBatch(configuration.getBatch()); > scan.setCaching(configuration.getCaching()); > ResultScanner resultScanner = table.getScanner(scan); > > List results = new ArrayList(); > for (Result result : resultScanner) { > results.add(result); > } > > connection.close(); > table.close(); > > return results; > } > > They implement Callable. > > > 2014-09-12 9:26 GMT+02:00 Michael Segel : > >> Lets take a step back…. >> >> Your parallel scan is having the client create N threads where in each >> thread, you’re doing a partial scan of the table where each partial scan >> takes the first and last row of each region? >> >> Is that correct? >> >> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz >> wrote: >> >> > I was checking a little bit more about,, I checked the cluster and data >> is >> > store in three different regions servers, each one in a differente node. >> > So, I guess the threads go to different hard-disks. >> > >> > If someone has an idea or suggestion.. why it's faster a single scan >> than >> > this implementation. I based on this implementation >> > https://github.com/zygm0nt/hbase-distributed-search >> > >> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz : >> > >> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, >> although >> >> there is not difference. >> >> I disabled the table and disabled the blockcache for that family and I >> put >> >> scan.setBlockcache(false) as well for both cases. >> >> >> >> I think that it's not possible that I executing an complete scan for >> each >> >> thread since my data are the type: >> >> 01 f:q value=1 >> >> 02 f:q value=2 >> >> 03 f:q value=3 >> >> ... >> >> >> >> I add all the values and get the same result on a single scan than a >> >> distributed, so, I guess that DistributedScan did well. >> >> The count from the hbase shell takes about 10-15seconds, I don't >> remember, >> >> but like 4x of the scan time. >> >> I'm not using any filter for the scans. >> >> >> >> This is the way I calculate number of regions/scans >> >> private List generatePartitions() { >> >>List regionScanners = new >> >> ArrayList(); >> >>byte[] startKey; >> >>byte[] stopKey; >> >>HConnection connection = null; >> >>HBaseAdmin hbaseAdmin = null; >> >>try { >> >>connection = >> >> HConnectionManager.createConnection(HBaseConfiguration.create()); >> >>hbaseAdmin = new HBaseAdmin(connection); >> >>List regions = >> >> hbaseAdmin.getTableRegions(scanConfiguration.get
Re: Scan vs Parallel scan.
yes, that is. I have changed the HBase version to 0.98 I got the start and stop keys with this method: private List generatePartitions() { List regionScanners = new ArrayList(); byte[] startKey; byte[] stopKey; HConnection connection = null; HBaseAdmin hbaseAdmin = null; try { connection = HConnectionManager. createConnection(HBaseConfiguration.create()); hbaseAdmin = new HBaseAdmin(connection); List regions = hbaseAdmin.getTableRegions(scanConfiguration.getTable()); RegionScanner regionScanner = null; for (HRegionInfo region : regions) { startKey = region.getStartKey(); stopKey = region.getEndKey(); regionScanner = new RegionScanner(startKey, stopKey, scanConfiguration); // regionScanner = createRegionScanner(startKey, stopKey); if (regionScanner != null) { regionScanners.add(regionScanner); } } And I execute the RegionScanner with this: public List call() throws Exception { HConnection connection = HConnectionManager.createConnection(HBaseConfiguration.create()); HTableInterface table = connection.getTable(configuration.getTable()); Scan scan = new Scan(startKey, stopKey); scan.setBatch(configuration.getBatch()); scan.setCaching(configuration.getCaching()); ResultScanner resultScanner = table.getScanner(scan); List results = new ArrayList(); for (Result result : resultScanner) { results.add(result); } connection.close(); table.close(); return results; } They implement Callable. 2014-09-12 9:26 GMT+02:00 Michael Segel : > Lets take a step back…. > > Your parallel scan is having the client create N threads where in each > thread, you’re doing a partial scan of the table where each partial scan > takes the first and last row of each region? > > Is that correct? > > On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz wrote: > > > I was checking a little bit more about,, I checked the cluster and data > is > > store in three different regions servers, each one in a differente node. > > So, I guess the threads go to different hard-disks. > > > > If someone has an idea or suggestion.. why it's faster a single scan than > > this implementation. I based on this implementation > > https://github.com/zygm0nt/hbase-distributed-search > > > > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz : > > > >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although > >> there is not difference. > >> I disabled the table and disabled the blockcache for that family and I > put > >> scan.setBlockcache(false) as well for both cases. > >> > >> I think that it's not possible that I executing an complete scan for > each > >> thread since my data are the type: > >> 01 f:q value=1 > >> 02 f:q value=2 > >> 03 f:q value=3 > >> ... > >> > >> I add all the values and get the same result on a single scan than a > >> distributed, so, I guess that DistributedScan did well. > >> The count from the hbase shell takes about 10-15seconds, I don't > remember, > >> but like 4x of the scan time. > >> I'm not using any filter for the scans. > >> > >> This is the way I calculate number of regions/scans > >> private List generatePartitions() { > >>List regionScanners = new > >> ArrayList(); > >>byte[] startKey; > >>byte[] stopKey; > >>HConnection connection = null; > >>HBaseAdmin hbaseAdmin = null; > >>try { > >>connection = > >> HConnectionManager.createConnection(HBaseConfiguration.create()); > >>hbaseAdmin = new HBaseAdmin(connection); > >>List regions = > >> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > >>RegionScanner regionScanner = null; > >>for (HRegionInfo region : regions) { > >> > >>startKey = region.getStartKey(); > >>stopKey = region.getEndKey(); > >> > >>regionScanner = new RegionScanner(startKey, stopKey, > >> scanConfiguration); > >>// regionScanner = createRegionScanner(startKey, > stopKey); > >>if (regionScanner != null) { > >>regionScanners.add(regionScanner); > >>} > >>} > >> >
Re: Scan vs Parallel scan.
I was checking a little bit more about,, I checked the cluster and data is store in three different regions servers, each one in a differente node. So, I guess the threads go to different hard-disks. If someone has an idea or suggestion.. why it's faster a single scan than this implementation. I based on this implementation https://github.com/zygm0nt/hbase-distributed-search 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz : > I'm working with HBase 0.94 for this case,, I'll try with 0.98, although > there is not difference. > I disabled the table and disabled the blockcache for that family and I put > scan.setBlockcache(false) as well for both cases. > > I think that it's not possible that I executing an complete scan for each > thread since my data are the type: > 01 f:q value=1 > 02 f:q value=2 > 03 f:q value=3 > ... > > I add all the values and get the same result on a single scan than a > distributed, so, I guess that DistributedScan did well. > The count from the hbase shell takes about 10-15seconds, I don't remember, > but like 4x of the scan time. > I'm not using any filter for the scans. > > This is the way I calculate number of regions/scans > private List generatePartitions() { > List regionScanners = new > ArrayList(); > byte[] startKey; > byte[] stopKey; > HConnection connection = null; > HBaseAdmin hbaseAdmin = null; > try { > connection = > HConnectionManager.createConnection(HBaseConfiguration.create()); > hbaseAdmin = new HBaseAdmin(connection); > List regions = > hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > RegionScanner regionScanner = null; > for (HRegionInfo region : regions) { > > startKey = region.getStartKey(); > stopKey = region.getEndKey(); > > regionScanner = new RegionScanner(startKey, stopKey, > scanConfiguration); > // regionScanner = createRegionScanner(startKey, stopKey); > if (regionScanner != null) { > regionScanners.add(regionScanner); > } > } > > I did some test for a tiny table and I think that the range for each scan > works fine. Although, I though that it was interesting that the time when I > execute distributed scan is about 6x. > > I'm going to check about the hard disks, but I think that ti's right. > > > > > 2014-09-11 7:50 GMT+02:00 lars hofhansl : > >> Which version of HBase? >> Can you show us the code? >> >> >> Your parallel scan with caching 100 takes about 6x as long as the single >> scan, which is suspicious because you say you have 6 regions. >> Are you sure you're not accidentally scanning all the data in each of >> your parallel scans? >> >> -- Lars >> >> >> >> >> From: Guillermo Ortiz >> To: "user@hbase.apache.org" >> Sent: Wednesday, September 10, 2014 1:40 AM >> Subject: Scan vs Parallel scan. >> >> >> Hi, >> >> I developed an distributed scan, I create an thread for each region. After >> that, I've tried to get some times Scan vs DistributedScan. >> I have disabled blockcache in my table. My cluster has 3 region servers >> with 2 regions each one, in total there are 100.000 rows and execute a >> complete scan. >> >> My partitions are >> -01666 -> request 16665 >> 01-02 -> request 1 >> 02-049998 -> request 1 >> 049998-04 -> request 1 >> 04-083330 -> request 1 >> 083330- -> request 16671 >> >> >> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10 >> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 -> >> Caching 10 >> >> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10 >> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 -> >> Caching 100 >> >> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10 >> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 -> >> Caching 1000 >> >> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10 >> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 -> >> Caching 1 >> >> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10 >> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 -> >> Caching 100 >> >> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10 >> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 -> >> Caching 1000 >> >> Parallel scan works much worse than simple scan,, and I don't know why >> it's >> so fast,, it's really much faster than execute an "count" from hbase >> shell, >> what it doesn't look pretty notmal. The only time that it works better >> parallel is when I execute a normal scan with caching 1. >> >> Any clue about it? >> > >
Re: Scan vs Parallel scan.
I'm working with HBase 0.94 for this case,, I'll try with 0.98, although there is not difference. I disabled the table and disabled the blockcache for that family and I put scan.setBlockcache(false) as well for both cases. I think that it's not possible that I executing an complete scan for each thread since my data are the type: 01 f:q value=1 02 f:q value=2 03 f:q value=3 ... I add all the values and get the same result on a single scan than a distributed, so, I guess that DistributedScan did well. The count from the hbase shell takes about 10-15seconds, I don't remember, but like 4x of the scan time. I'm not using any filter for the scans. This is the way I calculate number of regions/scans private List generatePartitions() { List regionScanners = new ArrayList(); byte[] startKey; byte[] stopKey; HConnection connection = null; HBaseAdmin hbaseAdmin = null; try { connection = HConnectionManager.createConnection(HBaseConfiguration.create()); hbaseAdmin = new HBaseAdmin(connection); List regions = hbaseAdmin.getTableRegions(scanConfiguration.getTable()); RegionScanner regionScanner = null; for (HRegionInfo region : regions) { startKey = region.getStartKey(); stopKey = region.getEndKey(); regionScanner = new RegionScanner(startKey, stopKey, scanConfiguration); // regionScanner = createRegionScanner(startKey, stopKey); if (regionScanner != null) { regionScanners.add(regionScanner); } } I did some test for a tiny table and I think that the range for each scan works fine. Although, I though that it was interesting that the time when I execute distributed scan is about 6x. I'm going to check about the hard disks, but I think that ti's right. 2014-09-11 7:50 GMT+02:00 lars hofhansl : > Which version of HBase? > Can you show us the code? > > > Your parallel scan with caching 100 takes about 6x as long as the single > scan, which is suspicious because you say you have 6 regions. > Are you sure you're not accidentally scanning all the data in each of your > parallel scans? > > -- Lars > > > > > From: Guillermo Ortiz > To: "user@hbase.apache.org" > Sent: Wednesday, September 10, 2014 1:40 AM > Subject: Scan vs Parallel scan. > > > Hi, > > I developed an distributed scan, I create an thread for each region. After > that, I've tried to get some times Scan vs DistributedScan. > I have disabled blockcache in my table. My cluster has 3 region servers > with 2 regions each one, in total there are 100.000 rows and execute a > complete scan. > > My partitions are > -01666 -> request 16665 > 01-02 -> request 1 > 02-049998 -> request 1 > 049998-04 -> request 1 > 04-083330 -> request 1 > 083330- -> request 16671 > > > 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10 > 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 -> > Caching 10 > > 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10 > 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 -> > Caching 100 > > 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10 > 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 -> > Caching 1000 > > 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10 > 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 -> > Caching 1 > > 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10 > 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 -> > Caching 100 > > 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10 > 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 -> > Caching 1000 > > Parallel scan works much worse than simple scan,, and I don't know why it's > so fast,, it's really much faster than execute an "count" from hbase shell, > what it doesn't look pretty notmal. The only time that it works better > parallel is when I execute a normal scan with caching 1. > > Any clue about it? >
Re: Scan vs Parallel scan.
What I want to say that I don't understand why a count takes more time than a complete scan without cache. I thought it should take more time to scan the table than to execute a count. Another point is why is slower an distributed scan than a sequential scan. Tomorrow I'll check how many disk we have. El miércoles, 10 de septiembre de 2014, Esteban Gutierrez < este...@cloudera.com> escribió: > Hello Guillermo, > > Sounds like some potential contention going on, how many disks per node you > have? > > Can you explain further what do you mean by "and I don't know why it's so > fast,, it's really much faster than execute an "count" from hbase shell," > the count command from the shell uses the FirstKeyOnlyFilter and a caching > of 10 which should be close to the behavior of your testing tool if its > using the same filter and the same cache settings. > > cheers, > esteban. > > > > > -- > Cloudera, Inc. > > > On Wed, Sep 10, 2014 at 1:40 AM, Guillermo Ortiz > > wrote: > > > Hi, > > > > I developed an distributed scan, I create an thread for each region. > After > > that, I've tried to get some times Scan vs DistributedScan. > > I have disabled blockcache in my table. My cluster has 3 region servers > > with 2 regions each one, in total there are 100.000 rows and execute a > > complete scan. > > > > My partitions are > > -01666 -> request 16665 > > 01-02 -> request 1 > > 02-049998 -> request 1 > > 049998-04 -> request 1 > > 04-083330 -> request 1 > > 083330- -> request 16671 > > > > > > 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10 > > 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 -> > > Caching 10 > > > > 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10 > > 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 -> > > Caching 100 > > > > 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10 > > 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 -> > > Caching 1000 > > > > 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10 > > 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 -> > > Caching 1 > > > > 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10 > > 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 -> > > Caching 100 > > > > 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10 > > 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 -> > > Caching 1000 > > > > Parallel scan works much worse than simple scan,, and I don't know why > it's > > so fast,, it's really much faster than execute an "count" from hbase > shell, > > what it doesn't look pretty notmal. The only time that it works better > > parallel is when I execute a normal scan with caching 1. > > > > Any clue about it? > > >
Scan vs Parallel scan.
Hi, I developed an distributed scan, I create an thread for each region. After that, I've tried to get some times Scan vs DistributedScan. I have disabled blockcache in my table. My cluster has 3 region servers with 2 regions each one, in total there are 100.000 rows and execute a complete scan. My partitions are -01666 -> request 16665 01-02 -> request 1 02-049998 -> request 1 049998-04 -> request 1 04-083330 -> request 1 083330- -> request 16671 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 -> Caching 10 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 -> Caching 100 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 -> Caching 1000 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 -> Caching 1 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 -> Caching 100 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 -> Caching 1000 Parallel scan works much worse than simple scan,, and I don't know why it's so fast,, it's really much faster than execute an "count" from hbase shell, what it doesn't look pretty notmal. The only time that it works better parallel is when I execute a normal scan with caching 1. Any clue about it?
How to know regions in a RegionServer?
How could I know with Java which Regions are served for each RegionServer? I want to execute an parallel scan, one thread for each regionServer because I think that's better than one for region, or is it not??
HFile Reader in 0.96 and 0.98
Hi, I am trying to make compatible HBase 0.96 with 0.98 but, I have seen that the HFile.createReader has a new parameter "Configuration". I have been looking for another way to get an Hfile.Reader from two versions with the same code but I think that it's not possible, is there another way to get an Hfile.Reader? I would like that my code works with both versions and I could switch versions quickly.
Coprocessor beacuse of TTL expired?
I want to use coprocessors (observers), Could I execute an coprocessors which executes my code when a row expired because the TTL has expired? will it be executed automatically? I mean,, without any scan or get over that row?? it's a pre and post delete or which observer?
Re: Store data in HBase with a MapReduce.
If I have to how me reducers I should have?? as many as number of regions?? I have read about HRegionPartitioner, but it has some limitations, and you have to be sure that any region isn't going to split while you're putting new data in your table. Is it only for performance? what could it happen if you put too many data in your table and it splits an region with a HRegionPartitioner? 2014-06-26 21:43 GMT+02:00 Stack : > Be sure to read http://hbase.apache.org/book.html#d3314e5975 Guillermo if > you have not already. Avoid reduce phase if you can. > > St.Ack > > > On Thu, Jun 26, 2014 at 8:24 AM, Guillermo Ortiz > wrote: > > > I have a question. > > I want to execute an MapReduce and the output of my reduce it's going to > > store in HBase. > > > > So, it's a MapReduce with an output which it's going to be stored in > HBase. > > I can do a Map and use HFileOutputFormat.configureIncrementalLoad(pJob, > > table); but, I don't know how I could do it if I have a Reduce as well,, > > since the configureIncrementalLoad generates an reduce. > > >
Store data in HBase with a MapReduce.
I have a question. I want to execute an MapReduce and the output of my reduce it's going to store in HBase. So, it's a MapReduce with an output which it's going to be stored in HBase. I can do a Map and use HFileOutputFormat.configureIncrementalLoad(pJob, table); but, I don't know how I could do it if I have a Reduce as well,, since the configureIncrementalLoad generates an reduce.
Re: Does compression ever improve performance?
I would like to see the times they got doing some scans or get with the benchmark about compression and block code to figure out how much time to save if your data are smaller but you have to decompress them. El sábado, 14 de junio de 2014, Kevin O'dell escribió: > Hi Jeremy, > > I always recommend turning on snappy compression, I have ~20% > performance increases. > On Jun 14, 2014 10:25 AM, "Ted Yu" > > wrote: > > > You may have read Doug Meil's writeup where he tried out different > > ColumnFamily > > compressions : > > > > https://blogs.apache.org/hbase/ > > > > Cheers > > > > > > On Fri, Jun 13, 2014 at 11:33 AM, jeremy p < > athomewithagroove...@gmail.com > > > > > wrote: > > > > > Thank you -- I'll go ahead and try compression. > > > > > > --Jeremy > > > > > > > > > On Fri, Jun 13, 2014 at 10:59 AM, Dima Spivak > > > > wrote: > > > > > > > I'd highly recommend it. In general, compressing your column families > > > will > > > > improve performance by reducing the resources required to get data > from > > > > disk (even when taking into account the CPU overhead of compressing > and > > > > decompressing). > > > > > > > > -Dima > > > > > > > > > > > > On Fri, Jun 13, 2014 at 10:35 AM, jeremy p < > > > athomewithagroove...@gmail.com > > > > > > > > > wrote: > > > > > > > > > Hey all, > > > > > > > > > > Right now, I'm not using compression on any of my tables, because > our > > > > data > > > > > doesn't take up a huge amount of space. However, I would turn on > > > > > compression if there was a chance it would improve HBase's > > performance. > > > > By > > > > > performance, I'm talking about the speed with which HBase responds > to > > > > > requests and retrieves data. > > > > > > > > > > Should I turn compression on? > > > > > > > > > > --Jeremy > > > > > > > > > > > > > > >
Delete rowKey in hexadecimal array bytes.
Hi, I'm generating key with SHA1, as it's a hex representation after generating the keys, I use Hex.decode to save memory since I could store them in half space. I have a MapReduce process which deletes some of these keys, the problem it's that it's that when I try to delete them, but I don't get it. If I don't do the parse to Hex, it works. So, For example, I put the keys in SHA like b343664e210e7a7abff3625a005e65e2b0d4616 works, but if I parse this key with Hex.decode to *\xB3CfN!\x0Ezz\xBF\xF3bZ\x00^e\xE2\xB0\ *column=l:dd, timestamp=1384317115000 it doesn't. I have been checked the code a lot but I think it's right, plus, if I comments the decode to Hex it works. Any clue about it? is there any problem with I am trying to??
Parallel Scan with TableMapReduceUtil
I am processing data from HBase with a MapReduce. The input of my MapReduce is a "full" scan of a table. When I execute a full scan with TableMapReduceUtil, is this scan executed in parallel, so all mappers get the data in parallel?? same way that if I would execute many range scans with threads?
Re: Error loading SHA-1 keys with load bulk
The error was that when I was emitting the , I was doing SHA about K, not about the key in the Value. The Value is a KeyValue and here's where I had to do the SHA1. 2014-05-02 0:42 GMT+02:00 Guillermo Ortiz : > Yes, I do, > > > job.setMapperClass(EventMapper.class); > job.setMapOutputKeyClass(ImmutableBytesWritable.class); > job.setMapOutputValueClass(KeyValue.class); > > FileOutputFormat.setOutputPath(job, hbasePath); > HTable table = new HTable(jConf, MEM_TABLE_HBASE); > HFileOutputFormat.configureIncrementalLoad(job, table); > > > The error is happeing in a MRUnit, I don't know if it changes something > about the behavior, because I had some troubles in the past for the same > reason about the serialization in Hbase 0.96 and MRUnit. > . Besides, in the setup of the MRUnit test I load some data in hbase with > keys in sha1 and it works. > > El jueves, 1 de mayo de 2014, Jean-Daniel Cryans > escribió: > > Are you using HFileOutputFormat.configureIncrementalLoad() to set up the >> partitioner and the reducers? That will take care of ordering your keys. >> >> J-D >> >> >> On Thu, May 1, 2014 at 5:38 AM, Guillermo Ortiz > >wrote: >> >> > I have been looking at the code in HBase, but, I don't really understand >> > what this error happens. Why can I put in HBase those keys? >> > >> > >> > 2014-04-30 17:57 GMT+02:00 Guillermo Ortiz >> > > > ');> >> > >: >> > >> > > I'm using HBase with MapReduce to load a lot of data, so I have >> decide to >> > > do it with bulk load. >> > > >> > > >> > > I parse my keys with SHA1, but when I try to load them, I got this >> > > exception. >> > > >> > > java.io.IOException: Added a key not lexically larger than previous >> > >> key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E, >> > >> lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E >> > > at >> > >> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207) >> > > at >> > >> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324) >> > > at >> > >> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289) >> > > at >> > >> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206) >> > > at >> > >> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) >> > > at >> > >> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) >> > > at >> > >> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551) >> > > at >> > >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85) >> > > >> > > I work with HBase 0.94.6. I have been loking for if I could define any >> > reducer, since, I have defined no one. I have read something about >> > KeyValueSortReducer but, I don'tknow if there's something that extends >> > TableReducer or I'm lookging for a wrong way. >> > > >> > > >> > > >> > >> >
Re: Error loading SHA-1 keys with load bulk
Yes, I do, job.setMapperClass(EventMapper.class); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(KeyValue.class); FileOutputFormat.setOutputPath(job, hbasePath); HTable table = new HTable(jConf, MEM_TABLE_HBASE); HFileOutputFormat.configureIncrementalLoad(job, table); The error is happeing in a MRUnit, I don't know if it changes something about the behavior, because I had some troubles in the past for the same reason about the serialization in Hbase 0.96 and MRUnit. . Besides, in the setup of the MRUnit test I load some data in hbase with keys in sha1 and it works. El jueves, 1 de mayo de 2014, Jean-Daniel Cryans escribió: > Are you using HFileOutputFormat.configureIncrementalLoad() to set up the > partitioner and the reducers? That will take care of ordering your keys. > > J-D > > > On Thu, May 1, 2014 at 5:38 AM, Guillermo Ortiz > > >wrote: > > > I have been looking at the code in HBase, but, I don't really understand > > what this error happens. Why can I put in HBase those keys? > > > > > > 2014-04-30 17:57 GMT+02:00 Guillermo Ortiz > > konstt2...@gmail.com > > ');> > > >: > > > > > I'm using HBase with MapReduce to load a lot of data, so I have decide > to > > > do it with bulk load. > > > > > > > > > I parse my keys with SHA1, but when I try to load them, I got this > > > exception. > > > > > > java.io.IOException: Added a key not lexically larger than previous > > > key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E, > > > lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E > > > at > > > org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207) > > > at > > > org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324) > > > at > > > org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289) > > > at > > > org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206) > > > at > > > org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) > > > at > > > org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) > > > at > > > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551) > > > at > > > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85) > > > > > > I work with HBase 0.94.6. I have been loking for if I could define any > > reducer, since, I have defined no one. I have read something about > > KeyValueSortReducer but, I don'tknow if there's something that extends > > TableReducer or I'm lookging for a wrong way. > > > > > > > > > > > >
Error loading SHA-1 keys with load bulk
I have been looking at the code in HBase, but, I don't really understand what this error happens. Why can I put in HBase those keys? 2014-04-30 17:57 GMT+02:00 Guillermo Ortiz >: > I'm using HBase with MapReduce to load a lot of data, so I have decide to > do it with bulk load. > > > I parse my keys with SHA1, but when I try to load them, I got this > exception. > > java.io.IOException: Added a key not lexically larger than previous > key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E, > > lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E > at > org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289) > at > org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206) > at > org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) > at > org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) > at > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85) > > I work with HBase 0.94.6. I have been loking for if I could define any > reducer, since, I have defined no one. I have read something about > KeyValueSortReducer but, I don'tknow if there's something that extends > TableReducer or I'm lookging for a wrong way. > > >
Error loading SHA-1 keys with load bulk
I'm using HBase with MapReduce to load a lot of data, so I have decide to do it with bulk load. I parse my keys with SHA1, but when I try to load them, I got this exception. java.io.IOException: Added a key not lexically larger than previous key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E, lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85) I work with HBase 0.94.6. I have been loking for if I could define any reducer, since, I have defined no one. I have read something about KeyValueSortReducer but, I don'tknow if there's something that extends TableReducer or I'm lookging for a wrong way.
Re: Weird behavior splitting regions
I read the article, that's why I typed the question, because I didn't understand the result I got. Oh, yes!!, that's true, so silly. I think some of the files are pretty small because the table has two families and one of them is much smaller than the another one. So, it has been splitted many times. The big regions get a size close to 1Gb, but the smaller regions has a final size pretty small because they have been splitted a lot of times. What I don't know, it's why HBase decides to split the table so late, not when I create the table presplitted if not, two hours later or whatever. Anyway, that's my error, I'm just curious about it. 2014-04-15 12:17 GMT+02:00 divye sheth : > The default split policy in hbase0.94.x is IncreaseToUpperBound rather than > ConstantSizeSplitPolicy which was the default in the older versions of > hbase. > > Please refer to the link given below to understand how a > IncreaseToUpperBoundSplitPolicy works: > http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ > check the auto-splitting section > > Hope this answers your question > > Thanks > Divye Sheth > > > > On Tue, Apr 15, 2014 at 3:36 PM, Bharath Vissapragada < > bhara...@cloudera.com > > wrote: > > > >There're some new regions that they're just a some KBytes!. Why they are > > so > > small?? When does HBase decide to split? because it started to split two > > hours later to create the table. > > > > When hbase does a split, it doesn't actually split at the disk/file > level. > > Its just a metadata operation which creates new regions that contain the > > reference files that still point to old HFiles. That is the reason you > find > > KB size regions. > > > > >I thought major compaction just happen once at day and compact many > files > > per region. Data is always the same here, I don't inject new data. > > > > IIRC sometimes minor compactions get promoted to major compactions based > on > > some criteria, but I'll leave it for others to answer! > > > > > > > > On Tue, Apr 15, 2014 at 3:15 PM, Guillermo Ortiz > >wrote: > > > > > I have a table in Hbase that sizes around 96Gb, > > > > > > I generate 4 regions of 30Gb. Some time, table starts to split because > > the > > > max size for region is 1Gb (I just realize of that, I'm going to change > > it > > > or create more pre-splits.). > > > > > > There're two things that I don't understand. how is it creating the > > splits? > > > right now I have 130 regions and growing. The problem is the size of > the > > > new regions: > > > > > > 1.7 M/hbase/filters/4ddbc34a2242e44c03121ae4608788a2 > > > 1.6 G/hbase/filters/548bdcec79cfe9a99fa57cb18f801be2 > > > 3.1 G/hbase/filters/58b50df089bd9d4d1f079f53238e060d > > > 2.5 M/hbase/filters/5a0d6d5b3b8faf67889ac5f5c2947c4f > > > 1.9 G/hbase/filters/5b0a35b5735a473b7e804c4b045ce374 > > > 883.4 M /hbase/filters/5b49c68e305b90d87b3c64a0eee60b8c > > > 1.7 M/hbase/filters/5d43fd7ea9808ab7d2f2134e80fbfae7 > > > 632.4 M /hbase/filters/5f04c7cd450d144f88fb4c7cff0796a2 > > > > > > There're some new regions that they're just a some KBytes!. Why they > are > > so > > > small?? When does HBase decide to split? because it started to split > two > > > hours later to create the table. > > > > > > One, I create the table and insert data, I don't insert new data or > > modify > > > them. > > > > > > > > > Another interested point it's why there're major compactions: > > > 2014-04-15 11:33:47,400 INFO > org.apache.hadoop.hbase.regionserver.Store: > > > Renaming compacted file at > > > > > > > > > hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/.tmp/df90c260cb4e4256a153dd178244f04c > > > to > > > > > > > > > hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/d/df90c260cb4e4256a153dd178244f04c > > > 2014-04-15 11:33:47,407 INFO > > > org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded ROWCOL > > > (CompoundBloomFilter) metadata for df90c260cb4e4256a153dd178244f04c > > > 2014-04-15 11:33:47,416 INFO > org.apache.hadoop.hbase.regionserver.Store:* > > > Completed major compaction of 1 file*(s) in d of > > > filters,51,1397554175140.ef994715505054299ede8c48c600cea4. into > > > df90c260cb4e4256a153dd178244f04c, size=789.
Weird behavior splitting regions
I have a table in Hbase that sizes around 96Gb, I generate 4 regions of 30Gb. Some time, table starts to split because the max size for region is 1Gb (I just realize of that, I'm going to change it or create more pre-splits.). There're two things that I don't understand. how is it creating the splits? right now I have 130 regions and growing. The problem is the size of the new regions: 1.7 M/hbase/filters/4ddbc34a2242e44c03121ae4608788a2 1.6 G/hbase/filters/548bdcec79cfe9a99fa57cb18f801be2 3.1 G/hbase/filters/58b50df089bd9d4d1f079f53238e060d 2.5 M/hbase/filters/5a0d6d5b3b8faf67889ac5f5c2947c4f 1.9 G/hbase/filters/5b0a35b5735a473b7e804c4b045ce374 883.4 M /hbase/filters/5b49c68e305b90d87b3c64a0eee60b8c 1.7 M/hbase/filters/5d43fd7ea9808ab7d2f2134e80fbfae7 632.4 M /hbase/filters/5f04c7cd450d144f88fb4c7cff0796a2 There're some new regions that they're just a some KBytes!. Why they are so small?? When does HBase decide to split? because it started to split two hours later to create the table. One, I create the table and insert data, I don't insert new data or modify them. Another interested point it's why there're major compactions: 2014-04-15 11:33:47,400 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/.tmp/df90c260cb4e4256a153dd178244f04c to hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/d/df90c260cb4e4256a153dd178244f04c 2014-04-15 11:33:47,407 INFO org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded ROWCOL (CompoundBloomFilter) metadata for df90c260cb4e4256a153dd178244f04c 2014-04-15 11:33:47,416 INFO org.apache.hadoop.hbase.regionserver.Store:* Completed major compaction of 1 file*(s) in d of filters,51,1397554175140.ef994715505054299ede8c48c600cea4. into df90c260cb4e4256a153dd178244f04c, size=789.1 M; total size for store is 789.1 M 2014-04-15 11:33:47,416 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=filters,51,1397554175140.ef994715505054299ede8c48c600cea4., storeName=d, fileCount=1, fileSize=1.5 G, priority=6, time=414761474510060; duration=7sec I thought major compaction just happen once at day and compact many files per region. Data is always the same here, I don't inject new data. I'm working with 0.94.6 CDH44. I'm going to change the size of the regions, but, I would like to understand why things happen. Thank you.
Re: How to generate a large dataset quickly.
But, if I'm using bulkLoad, I think this method bypasses the WAL, right? I have no idea about the autoFlush, is it still necessary to set to false or the bulkload does some kind of magic with that as well?? I could try to do the loads without bulkLoad, but, I don't think that's the problem, maybe, it's just the time the cluster needs, although, it seems like too much time. 2014-04-14 22:51 GMT+02:00 lars hofhansl : > +1 to what Vladimir said. > For the Puts in question you can also disable the write ahead log (WAL) > and issue a flush on the table after your ingest. > > -- Lars > > > - Original Message - > From: Vladimir Rodionov > To: "user@hbase.apache.org" > Cc: > Sent: Monday, April 14, 2014 11:15 AM > Subject: RE: How to generate a large dataset quickly. > > There is no need to run M/R unless your cluster is large (very large) > Single multithreaded client can easily ingest 10s of thousands rows per > sec. > Check YCSB benchmark tool, for example. > > Make sure you disable both region splitting and major compaction during > data ingestion > and pre-split regions accordingly to improve overall performance. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: vrodio...@carrieriq.com > > > > From: Ted Yu [yuzhih...@gmail.com] > Sent: Monday, April 14, 2014 9:16 AM > To: user@hbase.apache.org > Subject: Re: How to generate a large dataset quickly. > > I looked at revision history for HFileOutputFormat.java > There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't > affect throughput much. > > If you can use ganglia (or some similar tool) to pinpoint what caused the > low ingest rate, that would give us more clue. > > BTW Is upgrading to newer release, such as 0.98.1 (which contains > HBASE-8755), an option for you ? > > Cheers > > > On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz >wrote: > > > I'm using. 0.94.6-cdh4.4.0, > > > > I use the bulkload: > > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER)); > > FileOutputFormat.setOutputPath(job, hbasePath); > > HTable table = new HTable(jConf, HBASE_TABLE); > > HFileOutputFormat.configureIncrementalLoad(job, table); > > > > It seems that it takes really long time when it starts to execute the > Puts > > to HBase in the reduce phase. > > > > > > > > 2014-04-14 14:35 GMT+02:00 Ted Yu : > > > > > Which hbase release did you run mapreduce job ? > > > > > > Cheers > > > > > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz > > wrote: > > > > > > > I want to create a large dateset for HBase with different versions > and > > > > number of rows. It's about 10M rows and 100 versions to do some > > > benchmarks. > > > > > > > > What's the fastest way to create it?? I'm generating the dataset > with a > > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size > > > around > > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when > > > > MapReduces write the output and when transfer the output to the > > Reduces. > > > > > > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or notificati...@carrieriq.com and > delete or destroy any copy of this message and its attachments. >
Re: How to generate a large dataset quickly.
Are there some benchmark about how long could it takes to insert data in HBase to have a reference? The output of my Mapper has 3.2mill. output. So, I execute 3.2Mill of Put's in HBase. Well, data has to be copied and sent to the reducers, but with a network of 1Gb it shouldn't take too much time. I'll check Ganglia. 2014-04-14 18:16 GMT+02:00 Ted Yu : > I looked at revision history for HFileOutputFormat.java > There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't > affect throughput much. > > If you can use ganglia (or some similar tool) to pinpoint what caused the > low ingest rate, that would give us more clue. > > BTW Is upgrading to newer release, such as 0.98.1 (which contains > HBASE-8755), an option for you ? > > Cheers > > > On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz >wrote: > > > I'm using. 0.94.6-cdh4.4.0, > > > > I use the bulkload: > > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER)); > > FileOutputFormat.setOutputPath(job, hbasePath); > > HTable table = new HTable(jConf, HBASE_TABLE); > > HFileOutputFormat.configureIncrementalLoad(job, table); > > > > It seems that it takes really long time when it starts to execute the > Puts > > to HBase in the reduce phase. > > > > > > > > 2014-04-14 14:35 GMT+02:00 Ted Yu : > > > > > Which hbase release did you run mapreduce job ? > > > > > > Cheers > > > > > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz > > wrote: > > > > > > > I want to create a large dateset for HBase with different versions > and > > > > number of rows. It's about 10M rows and 100 versions to do some > > > benchmarks. > > > > > > > > What's the fastest way to create it?? I'm generating the dataset > with a > > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size > > > around > > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when > > > > MapReduces write the output and when transfer the output to the > > Reduces. > > > > > >
Re: How to generate a large dataset quickly.
I'm using. 0.94.6-cdh4.4.0, I use the bulkload: FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER)); FileOutputFormat.setOutputPath(job, hbasePath); HTable table = new HTable(jConf, HBASE_TABLE); HFileOutputFormat.configureIncrementalLoad(job, table); It seems that it takes really long time when it starts to execute the Puts to HBase in the reduce phase. 2014-04-14 14:35 GMT+02:00 Ted Yu : > Which hbase release did you run mapreduce job ? > > Cheers > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz wrote: > > > I want to create a large dateset for HBase with different versions and > > number of rows. It's about 10M rows and 100 versions to do some > benchmarks. > > > > What's the fastest way to create it?? I'm generating the dataset with a > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size > around > > 7Gb. I don't know if I could do it quickly. The bottleneck is when > > MapReduces write the output and when transfer the output to the Reduces. >
How to generate a large dataset quickly.
I want to create a large dateset for HBase with different versions and number of rows. It's about 10M rows and 100 versions to do some benchmarks. What's the fastest way to create it?? I'm generating the dataset with a Mapreduce of 100.000rows and 10verions. It takes 17minutes and size around 7Gb. I don't know if I could do it quickly. The bottleneck is when MapReduces write the output and when transfer the output to the Reduces.
Re: Lease exception when I execute large scan with filters.
gt;>> Another little question is, when the filter I'm using, Do I check > all the > >>>>>> versions? or just the newest? Because, I'm wondering if when I do a > scan > >>>>>> over all the table, I look for the value "5" in all the dataset or > I'm > >>>>>> just > >>>>>> looking for in one newest version of each value. > >>>>>> > >>>>>> > >>>>>> On 10/04/14 16:52, gortiz wrote: > >>>>>> > >>>>>> I was trying to check the behaviour of HBase. The cluster is a > group of > >>>>>>> old computers, one master, five slaves, each one with 2Gb, so, > 12gb in > >>>>>>> total. > >>>>>>> The table has a column family with 1000 columns and each column > with > >>>>>>> 100 > >>>>>>> versions. > >>>>>>> There's another column faimily with four columns an one image of > 100kb. > >>>>>>> (I've tried without this column family as well.) > >>>>>>> The table is partitioned manually in all the slaves, so data are > >>>>>>> balanced > >>>>>>> in the cluster. > >>>>>>> > >>>>>>> I'm executing this sentence *scan 'table1', {FILTER => > "ValueFilter(=, > >>>>>>> 'binary:5')"* in HBase 0.94.6 > >>>>>>> My time for lease and rpc is three minutes. > >>>>>>> Since, it's a full scan of the table, I have been playing with the > >>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of > >>>>>>> it). I > >>>>>>> thought that it was going to have too much calls to the GC. I'm not > >>>>>>> sure > >>>>>>> about this point. > >>>>>>> > >>>>>>> I know that it's not the best way to use HBase, it's just a test. I > >>>>>>> think > >>>>>>> that it's not working because the hardware isn't enough, although, > I > >>>>>>> would > >>>>>>> like to try some kind of tunning to improve it. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 10/04/14 14:21, Ted Yu wrote: > >>>>>>> > >>>>>>> Can you give us a bit more information: > >>>>>>>> HBase release you're running > >>>>>>>> What filters are used for the scan > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> > >>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz wrote: > >>>>>>>> > >>>>>>>> I got this error when I execute a full scan with filters about a > >>>>>>>> table. > >>>>>>>> > >>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase. > >>>>>>>>> regionserver.LeaseException: > >>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease > >>>>>>>>> '-4165751462641113359' does not exist > >>>>>>>>> at > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> at org.apache.hadoop.hbase.regionserver.HRegionServer. > >>>>>>>>> next(HRegionServer.java:2482) > >>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > >>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke( > >>>>>>>>> NativeMethodAccessorImpl.java:39) > >>>>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke( > >>>>>>>>> DelegatingMethodAccessorImpl.java:25) > >>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) > >>>>>>>>> at > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call( > >>>>>>>>> WritableRpcEngine.java:320) > >>>>>>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run( > >>>>>>>>> HBaseServer.java:1428) > >>>>>>>>> > >>>>>>>>> I have read about increase the lease time and rpc time, but it's > not > >>>>>>>>> working.. what else could I try?? The table isn't too big. I have > >>>>>>>>> been > >>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I > >>>>>>>>> didn't see > >>>>>>>>> anything weird. I tried as well to try with a couple of caching > >>>>>>>>> values. > >>>>>>>>> > >>>>>>>>> > >>>>>>> -- > >>>>>> *Guillermo Ortiz* > >>>>>> /Big Data Developer/ > >>>>>> > >>>>>> Telf.: +34 917 680 490< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>>> Fax: +34 913 833 301< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > >>>>>> > >>>>>> _http://www.bidoop.es_ > >>>>>> > >>>>>> > >>>>>> > >>> -- > >>> *Guillermo Ortiz* > >>> /Big Data Developer/ > >>> > >>> Telf.: +34 917 680 490< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>> Fax: +34 913 833 301< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > >>> > >>> _http://www.bidoop.es_ > >>> > >>> > > > > > > -- > > *Guillermo Ortiz* > > /Big Data Developer/ > > > > Telf.: +34 917 680 490 > > Fax: +34 913 833 301 > > C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > > > > _http://www.bidoop.es_ > > > >
Re: Lease exception when I execute large scan with filters.
Okay, thank you, I'll check it this Monday. I didn't know that Scan checks all the versions. So, I was checking each column and each version although it just showed me the newest version because I didn't indicate anything about the VERSIONS attribute. It makes sense that it takes so long. 2014-04-11 16:57 GMT+02:00 Ted Yu : > In your previous example: > scan 'table1', {FILTER => "ValueFilter(=, 'binary:5')"} > > there was no expression w.r.t. timestamp. See the following javadoc from > Scan.java: > > * To only retrieve columns within a specific range of version timestamps, > > * execute {@link #setTimeRange(long, long) setTimeRange}. > > * > > * To only retrieve columns with a specific timestamp, execute > > * {@link #setTimeStamp(long) setTimestamp}. > > You can use one of the above methods to make your scan more selective. > > > ValueFilter#filterKeyValue(Cell) doesn't utilize advanced feature of > ReturnCode. You can refer to: > > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.ReturnCode.html > > You can take a look at SingleColumnValueFilter#filterKeyValue() for example > of how various ReturnCode's are used to speed up scan. > > Cheers > > > On Fri, Apr 11, 2014 at 8:40 AM, Guillermo Ortiz >wrote: > > > I read something interesting about it in HBase TDG. > > > > Page 344: > > The StoreScanner class combines the store files and memstore that the > > Store instance > > contains. It is also where the exclusion happens, based on the Bloom > > filter, or the timestamp. If you are asking for versions that are not > more > > than 30 minutes old, for example, you can skip all storage files that are > > older than one hour: they will not contain anything of interest. See "Key > > Design" on page 357 for details on the exclusion, and how to make use of > > it. > > > > So, I guess that it doesn't have to read all the HFiles?? But, I don't > know > > if HBase really uses the timestamp of each row or the date of the file. I > > guess when I execute the scan, it reads everything, but, I don't know > why. > > I think there's something else that I don't see so that everything works > to > > me. > > > > > > 2014-04-11 13:05 GMT+02:00 gortiz : > > > > > Sorry, I didn't get it why it should read all the timestamps and not > just > > > the newest it they're sorted and you didn't specific any timestamp in > > your > > > filter. > > > > > > > > > > > > On 11/04/14 12:13, Anoop John wrote: > > > > > >> In the storage layer (HFiles in HDFS) all versions of a particular > cell > > >> will be staying together. (Yes it has to be lexicographically ordered > > >> KVs). So during a scan we will have to read all the version data. At > > this > > >> storage layer it doesn't know the versions stuff etc. > > >> > > >> -Anoop- > > >> > > >> On Fri, Apr 11, 2014 at 3:33 PM, gortiz wrote: > > >> > > >> Yes, I have tried with two different values for that value of > versions, > > >>> 1000 and maximum value for integers. > > >>> > > >>> But, I want to keep those versions. I don't want to keep just 3 > > versions. > > >>> Imagine that I want to record a new version each minute and store a > > day, > > >>> those are 1440 versions. > > >>> > > >>> Why is HBase going to read all the versions?? , I thought, if you > don't > > >>> indicate any versions it's just read the newest and skip the rest. It > > >>> doesn't make too much sense to read all of them if data is sorted, > plus > > >>> the > > >>> newest version is stored in the top. > > >>> > > >>> > > >>> > > >>> On 11/04/14 11:54, Anoop John wrote: > > >>> > > >>>What is the max version setting u have done for ur table cf? > When u > > >>>> set > > >>>> some a value, HBase has to keep all those versions. During a scan > it > > >>>> will > > >>>> read all those versions. In 94 version the default value for the max > > >>>> versions is 3. I guess you have set some bigger value. If u have > > not, > > >>>> mind testing after a major compaction? &g
Re: Lease exception when I execute large scan with filters.
gt;>>>>> I'm generating again the dataset with a bigger blocksize (previously >>>>>> was >>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning and >>>>>> baching parameters, but I don't think they're going to affect too >>>>>> much. >>>>>> >>>>>> Another test I want to do, it's generate the same dataset with just >>>>>> 100versions, It should spend around the same time, right? Or am I >>>>>> wrong? >>>>>> >>>>>> On 10/04/14 18:08, Ted Yu wrote: >>>>>> >>>>>> It should be newest version of each value. >>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz wrote: >>>>>>> >>>>>>> Another little question is, when the filter I'm using, Do I check all >>>>>>> the >>>>>>> >>>>>>>versions? or just the newest? Because, I'm wondering if when I do >>>>>>>> a >>>>>>>> scan >>>>>>>> over all the table, I look for the value "5" in all the dataset or >>>>>>>> I'm >>>>>>>> just >>>>>>>> looking for in one newest version of each value. >>>>>>>> >>>>>>>> >>>>>>>> On 10/04/14 16:52, gortiz wrote: >>>>>>>> >>>>>>>> I was trying to check the behaviour of HBase. The cluster is a group >>>>>>>> of >>>>>>>> >>>>>>>> old computers, one master, five slaves, each one with 2Gb, so, 12gb >>>>>>>>> in >>>>>>>>> total. >>>>>>>>> The table has a column family with 1000 columns and each column >>>>>>>>> with >>>>>>>>> 100 >>>>>>>>> versions. >>>>>>>>> There's another column faimily with four columns an one image of >>>>>>>>> 100kb. >>>>>>>>> (I've tried without this column family as well.) >>>>>>>>> The table is partitioned manually in all the slaves, so data are >>>>>>>>> balanced >>>>>>>>> in the cluster. >>>>>>>>> >>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER => >>>>>>>>> "ValueFilter(=, >>>>>>>>> 'binary:5')"* in HBase 0.94.6 >>>>>>>>> My time for lease and rpc is three minutes. >>>>>>>>> Since, it's a full scan of the table, I have been playing with the >>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of >>>>>>>>> it). I >>>>>>>>> thought that it was going to have too much calls to the GC. I'm not >>>>>>>>> sure >>>>>>>>> about this point. >>>>>>>>> >>>>>>>>> I know that it's not the best way to use HBase, it's just a test. I >>>>>>>>> think >>>>>>>>> that it's not working because the hardware isn't enough, although, >>>>>>>>> I >>>>>>>>> would >>>>>>>>> like to try some kind of tunning to improve it. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/04/14 14:21, Ted Yu wrote: >>>>>>>>> >>>>>>>>> Can you give us a bit more information: >>>>>>>>> >>>>>>>>> HBase release you're running >>>>>>>>>> What filters are used for the scan >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz wrote: >>>>>>>>>> >>>>>>>>>> I got this error when I execute a full scan with filters >>>>>>>>>> about a >>>>>>>>>> table. >>>>>>>>>> >>>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase. >>>>>>>>>> >>>>>>>>>>> regionserver.LeaseException: >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease >>>>>>>>>>> '-4165751462641113359' does not exist >>>>>>>>>>>at org.apache.hadoop.hbase.regionserver.Leases. >>>>>>>>>>> removeLease(Leases.java:231) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>at org.apache.hadoop.hbase.regionserver.HRegionServer. >>>>>>>>>>> next(HRegionServer.java:2482) >>>>>>>>>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>>>>>>> Method) >>>>>>>>>>>at sun.reflect.NativeMethodAccessorImpl.invoke( >>>>>>>>>>> NativeMethodAccessorImpl.java:39) >>>>>>>>>>>at sun.reflect.DelegatingMethodAccessorImpl.invoke( >>>>>>>>>>> DelegatingMethodAccessorImpl.java:25) >>>>>>>>>>>at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>>>>>at org.apache.hadoop.hbase.ipc. >>>>>>>>>>> WritableRpcEngine$Server.call( >>>>>>>>>>> WritableRpcEngine.java:320) >>>>>>>>>>>at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run( >>>>>>>>>>> HBaseServer.java:1428) >>>>>>>>>>> >>>>>>>>>>> I have read about increase the lease time and rpc time, but it's >>>>>>>>>>> not >>>>>>>>>>> working.. what else could I try?? The table isn't too big. I have >>>>>>>>>>> been >>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I >>>>>>>>>>> didn't see >>>>>>>>>>> anything weird. I tried as well to try with a couple of caching >>>>>>>>>>> values. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>> *Guillermo Ortiz* >>>>>>>> /Big Data Developer/ >>>>>>>> >>>>>>>> Telf.: +34 917 680 490<https://mail.google.com/ >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>>>>>>> Fax: +34 913 833 301<https://mail.google.com/ >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>>>>>>> >>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>>>>>>> >>>>>>>> _http://www.bidoop.es_ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>> *Guillermo Ortiz* >>>>> /Big Data Developer/ >>>>> >>>>> Telf.: +34 917 680 490<https://mail.google.com/ >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>>>> Fax: +34 913 833 301<https://mail.google.com/ >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#> >>>>> >>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>>>> >>>>> _http://www.bidoop.es_ >>>>> >>>>> >>>>> >>>>> -- >>> *Guillermo Ortiz* >>> /Big Data Developer/ >>> >>> Telf.: +34 917 680 490<https://mail.google.com/mail/ >>> u/0/html/compose/static_files/blank_quirks.html#> >>> Fax: +34 913 833 301<https://mail.google.com/mail/ >>> u/0/html/compose/static_files/blank_quirks.html#> >>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>> >>> _http://www.bidoop.es_ >>> >>> >>> > > -- > *Guillermo Ortiz* > /Big Data Developer/ > > Telf.: +34 917 680 490 > Fax: +34 913 833 301 > C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain > > _http://www.bidoop.es_ > >