Parse Scan to Base64

2019-07-22 Thread Guillermo Ortiz Fernández
I want to parse a Scan object to Base64. I know that there is a method to
do it, but this method in the beginning it was a scope of package and I
can't update the version of this library and I can't use it.

I have checked the code and it seems that it does more than encode to
Base64 and do some things before to do it. Is it true? it's not so easy as
to use java.util.Base64 or another class to do it for myself?


Trying to execute a put to docker cloudera hbase

2019-06-27 Thread Guillermo Ortiz Fernández
I'm trying to insert a record in HBase. I have a docker cloudera
quickstarter image.
I can connect to zookeeper and hbase port from my machine to the docker
expose ports with telnet (2181, 6, 60010).

I have included in the classpath the files hbase-site, core-site,
hdfs-site.

When the put is executed the program stays blocked. Why?







*TableName nameTable = TableName.valueOf(NAMETABLE);  Connection conn =
ConnectionFactory.createConnection(config);  hTable =
conn.getTable(nameTable);   Put p = new
Put(Bytes.toBytes(informations.getString("id")));p.add(Bytes.toBytes("informations"),Bytes.toBytes("carId"),Bytes.toBytes(carId))hTable.put(p);
--> blocked*


Scan vs TableInputFormat to process data

2019-05-28 Thread Guillermo Ortiz Fernández
Just to be sure, if I execute Scan inside Spark, the execution is goig
through RegionServers and I get all the features of HBase/Scan (filters and
so on), all the parallelization is in charge of the RegionServers (even
I'm  running the program with spark)
If I use TableInputFormat I read all the column families (even If I don't
want to) , not previous filter either, it's just open the files of a hbase
table and process them completly. All te parallelization is in Spark and
don't use HBase at all, it's just read in HDFS the files what HBase stored
for a specific table.

Am I missing something?


Re: Reading the whole table with MapReduce and Spark.

2019-05-28 Thread Guillermo Ortiz Fernández
Another little doubt it's: if I use the class TableinputFormat to read a
HBase table, Am I going to read the whole table? or data what haven't been
flushed to storefiles it's not going to be read?

El mié., 29 may. 2019 a las 0:14, Guillermo Ortiz Fernández (<
guillermo.ortiz.f...@gmail.com>) escribió:

> it depends of the row, they did only share 5% of the qualifiers names.
> Each row could have about 500-3000 columns in 3 column families. One of
> them has 80% of the columns.
>
> The table has around 75M of rows.
>
> El mar., 28 may. 2019 a las 17:33,  escribió:
>
>> Guillermo
>>
>>
>> How large is your table?   How many columns?
>>
>>
>> Sincerely,
>>
>> Sean
>>
>> > On May 28, 2019 at 10:11 AM Guillermo Ortiz > mailto:konstt2...@gmail.com > wrote:
>> >
>> >
>> > I have a doubt. When you process a Hbase table with MapReduce you
>> could use
>> > the TableInputFormat, I understand that it goes directly to HDFS
>> files
>> > (storesFiles in HDFS) , so you could do some filter in the map
>> phase and
>> > it's not the same to go through to the region servers to do some
>> massive
>> > queriesIt's possible to do the same using TableInputFormat with
>> Spark and
>> > it's more efficient than use scan with filters and so on (again)
>> when you
>> > want to do a massive query about all the table. Am I right?
>> >
>>
>


Re: Reading the whole table with MapReduce and Spark.

2019-05-28 Thread Guillermo Ortiz Fernández
it depends of the row, they did only share 5% of the qualifiers names. Each
row could have about 500-3000 columns in 3 column families. One of them has
80% of the columns.

The table has around 75M of rows.

El mar., 28 may. 2019 a las 17:33,  escribió:

> Guillermo
>
>
> How large is your table?   How many columns?
>
>
> Sincerely,
>
> Sean
>
> > On May 28, 2019 at 10:11 AM Guillermo Ortiz  mailto:konstt2...@gmail.com > wrote:
> >
> >
> > I have a doubt. When you process a Hbase table with MapReduce you
> could use
> > the TableInputFormat, I understand that it goes directly to HDFS
> files
> > (storesFiles in HDFS) , so you could do some filter in the map phase
> and
> > it's not the same to go through to the region servers to do some
> massive
> > queriesIt's possible to do the same using TableInputFormat with
> Spark and
> > it's more efficient than use scan with filters and so on (again)
> when you
> > want to do a massive query about all the table. Am I right?
> >
>


Reading the whole table with MapReduce and Spark.

2019-05-28 Thread Guillermo Ortiz
I have a doubt. When you process a Hbase table with MapReduce you could use
the TableInputFormat, I understand that it goes directly to HDFS files
(storesFiles in HDFS) , so you could do some filter in the map phase and
it's not the same to go through to the region servers to do some massive
queriesIt's possible to do the same using TableInputFormat with Spark and
it's more efficient than use scan with filters and so on (again) when you
want to do a massive query about all the table. Am I right?


Re: table.put(singlePut) vs table.put(ListPut)

2019-05-10 Thread Guillermo Ortiz
I thought that but I saw this
https://stackoverflow.com/questions/28754077/is-hbase-batch-put-putlistput-faster-than-putput-what-is-the-capacity-of

El jue., 9 may. 2019 19:08, Andor Molnar 
escribió:

> Hi Guillermo,
>
> I'm not sure which version of HBase you're referring to, but it looks like
> the API docs are quite clear about these 2 calls:
> "Puts some data in the table." vs "Batch puts the specified data into the
> table."
>
> Essentially you can pass multiple Put commands to the second call
> (listPuts) and it uses the batch API, so it's probably more efficient for
> bulk load scenarios.
>
> Regards,
> Andor
>
>
>
> On Thu, May 9, 2019 at 4:05 PM Guillermo Ortiz 
> wrote:
>
> > I have seen that there are two methods to make a put, I would like to
> know
> > if there's any different between call table.put(put) or
> table.put(listPuts)
> > or under the hood is the same?
> >
> > My use case is to put many records with spark (kind of bulk load)
> >
>


Re: Max Number of versions in HBase

2019-05-09 Thread Guillermo Ortiz Fernández
ok

El mar., 7 may. 2019 a las 20:25, Ankit Singhal ()
escribió:

> HBase versions generally impact the size of the store file (and could
> impact performance while scanning), if you are in following limits and know
> what you are doing, you should be fine
> 1. Single Row size not exceeding your region size
> 2. Max versions are in hundreds
> 3. And you have HBASE-11544 in your HBase version.
>
> Regards,
> Ankit Singhal
>
> On Tue, May 7, 2019 at 5:46 AM Guillermo Ortiz Fernández <
> guillermo.ortiz.f...@gmail.com> wrote:
>
> > Hello,
> >
> > I'm thinking about the design of one hbase table and one approximation
> has
> > two CF and some of the columns could have hundreds of versions and others
> > columns in the same CF only a few values.
> >
> > If I think how HBase saves data in storefiles it seems that there aren't
> > problems about this, but I would like to ask about,, any problem about
> save
> > hundreds of versions? something to keep in mind?
> >
>


table.put(singlePut) vs table.put(ListPut)

2019-05-09 Thread Guillermo Ortiz
I have seen that there are two methods to make a put, I would like to know
if there's any different between call table.put(put) or table.put(listPuts)
or under the hood is the same?

My use case is to put many records with spark (kind of bulk load)


Max Number of versions in HBase

2019-05-07 Thread Guillermo Ortiz Fernández
Hello,

I'm thinking about the design of one hbase table and one approximation has
two CF and some of the columns could have hundreds of versions and others
columns in the same CF only a few values.

If I think how HBase saves data in storefiles it seems that there aren't
problems about this, but I would like to ask about,, any problem about save
hundreds of versions? something to keep in mind?


Re: Number of tables in HBase.

2016-01-18 Thread Guillermo Ortiz
Sorry, I mean just few column families,, 1 to 3 column families. But I
don't know if it's a good idea to have too many tables in HBase.

2016-01-18 11:15 GMT+01:00 Guillermo Ortiz :

> Hello,
>
> I know that a table should have a content number of CFs. How about the
> number of tables in the same HBase cluster? it should be okay to have
> dozens of tables or it's thought to have just a few number of tables?
>


Number of tables in HBase.

2016-01-18 Thread Guillermo Ortiz
Hello,

I know that a table should have a content number of CFs. How about the
number of tables in the same HBase cluster? it should be okay to have
dozens of tables or it's thought to have just a few number of tables?


Re: Tool to to execute an benchmark for HBase.

2015-01-30 Thread Guillermo Ortiz
I have coming back to the benchmark.I executde this command:
yscb run hbase -P workflowA -p columnfamilty=cf -p
operationcount=10 threads=32

And I got an performace of 2000op/seg
What I did later it's to execute ten of those commands in parallel and
I got about 18000op/sec  in total. I don't get 2000op/sec for each ot
them executions but I got about 1800op/sec

I don't know if ti's an HBase question, but, I don't understand why I
got more performance if I execute more commands in parallel if I
already execute 32 threads.
I took a look to the "top" and I saw that in the first (just one
process) the CPU was working about 20-60% when I launch more processes
the CPU it's about 400-500%.



2015-01-29 18:23 GMT+01:00 Guillermo Ortiz :
> There's an option when you execute yscb to say how many clients
> threads you want to use. I tried with 1/8/16/32. Those results are
> with 16, the improvement 1vs8 it's pretty high not as much 16 to 32.
> I only use one yscb, could it be that important?
>
> -threads : the number of client threads. By default, the YCSB Client
> uses a single worker thread, but additional threads can be specified.
> This is often done to increase the amount of load offered against the
> database.
>
> 2015-01-29 17:27 GMT+01:00 Nishanth S :
>> How many instances of ycsb do you run and how many threads do you use per
>> instance.I guess these ops are per instance and  you should get similar
>> numbers if you run  more instances.In short try running more  workload
>> instances...
>>
>> -Nishanth
>>
>> On Thu, Jan 29, 2015 at 8:49 AM, Guillermo Ortiz 
>> wrote:
>>
>>> Yes, I'm using 40%. i can't access to those data either.
>>> I don't know how YSCB executes the reads and if they are random and
>>> could take advange of the cache.
>>>
>>> Do you think that it's an acceptable performance?
>>>
>>>
>>> 2015-01-29 16:26 GMT+01:00 Ted Yu :
>>> > What's the value for hfile.block.cache.size ?
>>> >
>>> > By default it is 40%. You may want to increase its value if you're using
>>> > default.
>>> >
>>> > Andrew published some ycsb results :
>>> > http://people.apache.org/~apurtell/results-ycsb-0.98.8/ycsb
>>> > -0.98.0-vs-0.98.8.pdf
>>> >
>>> > However, I couldn't access the above now.
>>> >
>>> > Cheers
>>> >
>>> > On Thu, Jan 29, 2015 at 7:14 AM, Guillermo Ortiz 
>>> > wrote:
>>> >
>>> >> Is there any result with that benchmark to compare??
>>> >> I'm executing the different workloads and for example for 100% Reads
>>> >> in a table with 10Millions of records I only get an performance of
>>> >> 2000operations/sec. I hoped much better performance but I could be
>>> >> wrong. I'd like to know if it's a normal performance or I could have
>>> >> something bad configured.
>>> >>
>>> >>
>>> >> I have splitted the tabled and all the records are balanced and used
>>> >> snappy.
>>> >> The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32
>>> >> w/ Hyperthreading), 0.98.6-cdh5.3.0,
>>> >>
>>> >> RegionServer is executed with these parameters:
>>> >>  /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
>>> >> -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m
>>> >> -Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936
>>> >> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
>>> >> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
>>> >> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
>>> >> -Dhbase.log.dir=/var/log/hbase
>>> >>
>>> >>
>>> -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out
>>> >>
>>> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase
>>> >> -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA
>>> >>
>>> >>
>>> -Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native
>>> >> -Dhbase.security.logger=INFO,RFAS
>>> >> org.apache.hadoop.hbase.regionserver.HRegionServer start
>>> >>
>>> >>
>>> >> The results for 100% reads are
>>> >> [OVERALL], RunTime(ms), 42734.0
>>> &

Re: Tool to to execute an benchmark for HBase.

2015-01-29 Thread Guillermo Ortiz
There's an option when you execute yscb to say how many clients
threads you want to use. I tried with 1/8/16/32. Those results are
with 16, the improvement 1vs8 it's pretty high not as much 16 to 32.
I only use one yscb, could it be that important?

-threads : the number of client threads. By default, the YCSB Client
uses a single worker thread, but additional threads can be specified.
This is often done to increase the amount of load offered against the
database.

2015-01-29 17:27 GMT+01:00 Nishanth S :
> How many instances of ycsb do you run and how many threads do you use per
> instance.I guess these ops are per instance and  you should get similar
> numbers if you run  more instances.In short try running more  workload
> instances...
>
> -Nishanth
>
> On Thu, Jan 29, 2015 at 8:49 AM, Guillermo Ortiz 
> wrote:
>
>> Yes, I'm using 40%. i can't access to those data either.
>> I don't know how YSCB executes the reads and if they are random and
>> could take advange of the cache.
>>
>> Do you think that it's an acceptable performance?
>>
>>
>> 2015-01-29 16:26 GMT+01:00 Ted Yu :
>> > What's the value for hfile.block.cache.size ?
>> >
>> > By default it is 40%. You may want to increase its value if you're using
>> > default.
>> >
>> > Andrew published some ycsb results :
>> > http://people.apache.org/~apurtell/results-ycsb-0.98.8/ycsb
>> > -0.98.0-vs-0.98.8.pdf
>> >
>> > However, I couldn't access the above now.
>> >
>> > Cheers
>> >
>> > On Thu, Jan 29, 2015 at 7:14 AM, Guillermo Ortiz 
>> > wrote:
>> >
>> >> Is there any result with that benchmark to compare??
>> >> I'm executing the different workloads and for example for 100% Reads
>> >> in a table with 10Millions of records I only get an performance of
>> >> 2000operations/sec. I hoped much better performance but I could be
>> >> wrong. I'd like to know if it's a normal performance or I could have
>> >> something bad configured.
>> >>
>> >>
>> >> I have splitted the tabled and all the records are balanced and used
>> >> snappy.
>> >> The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32
>> >> w/ Hyperthreading), 0.98.6-cdh5.3.0,
>> >>
>> >> RegionServer is executed with these parameters:
>> >>  /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
>> >> -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m
>> >> -Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936
>> >> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
>> >> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
>> >> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
>> >> -Dhbase.log.dir=/var/log/hbase
>> >>
>> >>
>> -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out
>> >>
>> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase
>> >> -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA
>> >>
>> >>
>> -Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native
>> >> -Dhbase.security.logger=INFO,RFAS
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer start
>> >>
>> >>
>> >> The results for 100% reads are
>> >> [OVERALL], RunTime(ms), 42734.0
>> >> [OVERALL], Throughput(ops/sec), 2340.0570973931763
>> >> [UPDATE], Operations, 1.0
>> >> [UPDATE], AverageLatency(us), 103170.0
>> >> [UPDATE], MinLatency(us), 103168.0
>> >> [UPDATE], MaxLatency(us), 103171.0
>> >> [UPDATE], 95thPercentileLatency(ms), 103.0
>> >> [UPDATE], 99thPercentileLatency(ms), 103.0
>> >> [READ], Operations, 10.0
>> >> [READ], AverageLatency(us), 412.5534
>> >> [READ], AverageLatency(us,corrected), 581.6249026771276
>> >> [READ], MinLatency(us), 218.0
>> >> [READ], MaxLatency(us), 268383.0
>> >> [READ], MaxLatency(us,corrected), 268383.0
>> >> [READ], 95thPercentileLatency(ms), 0.0
>> >> [READ], 95thPercentileLatency(ms,corrected), 0.0
>> >> [READ], 99thPercentileLatency(ms), 0.0
>> >> [READ], 99thPercentileLatency(ms,corrected), 0.0
>> >> [READ], Return=0, 10
>> >> [CLEANUP], Operations, 1.0
>> >> [CLEANUP], AverageLatency(us), 103598.0
>

Re: Tool to to execute an benchmark for HBase.

2015-01-29 Thread Guillermo Ortiz
Yes, I'm using 40%. i can't access to those data either.
I don't know how YSCB executes the reads and if they are random and
could take advange of the cache.

Do you think that it's an acceptable performance?


2015-01-29 16:26 GMT+01:00 Ted Yu :
> What's the value for hfile.block.cache.size ?
>
> By default it is 40%. You may want to increase its value if you're using
> default.
>
> Andrew published some ycsb results :
> http://people.apache.org/~apurtell/results-ycsb-0.98.8/ycsb
> -0.98.0-vs-0.98.8.pdf
>
> However, I couldn't access the above now.
>
> Cheers
>
> On Thu, Jan 29, 2015 at 7:14 AM, Guillermo Ortiz 
> wrote:
>
>> Is there any result with that benchmark to compare??
>> I'm executing the different workloads and for example for 100% Reads
>> in a table with 10Millions of records I only get an performance of
>> 2000operations/sec. I hoped much better performance but I could be
>> wrong. I'd like to know if it's a normal performance or I could have
>> something bad configured.
>>
>>
>> I have splitted the tabled and all the records are balanced and used
>> snappy.
>> The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32
>> w/ Hyperthreading), 0.98.6-cdh5.3.0,
>>
>> RegionServer is executed with these parameters:
>>  /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
>> -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m
>> -Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936
>> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
>> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
>> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
>> -Dhbase.log.dir=/var/log/hbase
>>
>> -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out
>> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase
>> -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA
>>
>> -Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native
>> -Dhbase.security.logger=INFO,RFAS
>> org.apache.hadoop.hbase.regionserver.HRegionServer start
>>
>>
>> The results for 100% reads are
>> [OVERALL], RunTime(ms), 42734.0
>> [OVERALL], Throughput(ops/sec), 2340.0570973931763
>> [UPDATE], Operations, 1.0
>> [UPDATE], AverageLatency(us), 103170.0
>> [UPDATE], MinLatency(us), 103168.0
>> [UPDATE], MaxLatency(us), 103171.0
>> [UPDATE], 95thPercentileLatency(ms), 103.0
>> [UPDATE], 99thPercentileLatency(ms), 103.0
>> [READ], Operations, 10.0
>> [READ], AverageLatency(us), 412.5534
>> [READ], AverageLatency(us,corrected), 581.6249026771276
>> [READ], MinLatency(us), 218.0
>> [READ], MaxLatency(us), 268383.0
>> [READ], MaxLatency(us,corrected), 268383.0
>> [READ], 95thPercentileLatency(ms), 0.0
>> [READ], 95thPercentileLatency(ms,corrected), 0.0
>> [READ], 99thPercentileLatency(ms), 0.0
>> [READ], 99thPercentileLatency(ms,corrected), 0.0
>> [READ], Return=0, 10
>> [CLEANUP], Operations, 1.0
>> [CLEANUP], AverageLatency(us), 103598.0
>> [CLEANUP], MinLatency(us), 103596.0
>> [CLEANUP], MaxLatency(us), 103599.0
>> [CLEANUP], 95thPercentileLatency(ms), 103.0
>> [CLEANUP], 99thPercentileLatency(ms), 103.0
>>
>> hbase(main):030:0> describe 'username'
>> DESCRIPTION
>> ENABLED
>>  'username', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER
>> => 'ROW', REPLICATION_SCOPE => '0', true
>>   VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', TTL
>> => 'FOREVER', KEEP_DELETED_CELLS => '
>>  false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
>> 1 row(s) in 0.0170 seconds
>>
>> 2015-01-29 5:27 GMT+01:00 Ted Yu :
>> > Maybe ask on Cassandra mailing list for the benchmark tool they use ?
>> >
>> > Cheers
>> >
>> > On Wed, Jan 28, 2015 at 1:23 PM, Guillermo Ortiz 
>> > wrote:
>> >
>> >> I was checking that web, do you know if there's another possibility
>> >> since last updated for Cassandra was two years ago and I'd like to
>> >> compare bothof them with kind of same tool/code.
>> >>
>> >> 2015-01-28 22:10 GMT+01:00 Ted Yu :
>> >> > Guillermo:
>> >> > If you use hbase 0.98.x, please consid

Re: Tool to to execute an benchmark for HBase.

2015-01-29 Thread Guillermo Ortiz
Is there any result with that benchmark to compare??
I'm executing the different workloads and for example for 100% Reads
in a table with 10Millions of records I only get an performance of
2000operations/sec. I hoped much better performance but I could be
wrong. I'd like to know if it's a normal performance or I could have
something bad configured.


I have splitted the tabled and all the records are balanced and used snappy.
The cluster has a master and 4 regions servers with 256Gb,Cores 2 (32
w/ Hyperthreading), 0.98.6-cdh5.3.0,

RegionServer is executed with these parameters:
 /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
-XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m
-Djava.net.preferIPv4Stack=true -Xms640679936 -Xmx640679936
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
-XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
-XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
-Dhbase.log.dir=/var/log/hbase
-Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-cnsalbsrvcl23.lvtc.gsnet.corp.log.out
-Dhbase.home.dir=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase
-Dhbase.id.str= -Dhbase.root.logger=INFO,RFA
-Djava.library.path=/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native
-Dhbase.security.logger=INFO,RFAS
org.apache.hadoop.hbase.regionserver.HRegionServer start


The results for 100% reads are
[OVERALL], RunTime(ms), 42734.0
[OVERALL], Throughput(ops/sec), 2340.0570973931763
[UPDATE], Operations, 1.0
[UPDATE], AverageLatency(us), 103170.0
[UPDATE], MinLatency(us), 103168.0
[UPDATE], MaxLatency(us), 103171.0
[UPDATE], 95thPercentileLatency(ms), 103.0
[UPDATE], 99thPercentileLatency(ms), 103.0
[READ], Operations, 10.0
[READ], AverageLatency(us), 412.5534
[READ], AverageLatency(us,corrected), 581.6249026771276
[READ], MinLatency(us), 218.0
[READ], MaxLatency(us), 268383.0
[READ], MaxLatency(us,corrected), 268383.0
[READ], 95thPercentileLatency(ms), 0.0
[READ], 95thPercentileLatency(ms,corrected), 0.0
[READ], 99thPercentileLatency(ms), 0.0
[READ], 99thPercentileLatency(ms,corrected), 0.0
[READ], Return=0, 10
[CLEANUP], Operations, 1.0
[CLEANUP], AverageLatency(us), 103598.0
[CLEANUP], MinLatency(us), 103596.0
[CLEANUP], MaxLatency(us), 103599.0
[CLEANUP], 95thPercentileLatency(ms), 103.0
[CLEANUP], 99thPercentileLatency(ms), 103.0

hbase(main):030:0> describe 'username'
DESCRIPTION
ENABLED
 'username', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER
=> 'ROW', REPLICATION_SCOPE => '0', true
  VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', TTL
=> 'FOREVER', KEEP_DELETED_CELLS => '
 false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.0170 seconds

2015-01-29 5:27 GMT+01:00 Ted Yu :
> Maybe ask on Cassandra mailing list for the benchmark tool they use ?
>
> Cheers
>
> On Wed, Jan 28, 2015 at 1:23 PM, Guillermo Ortiz 
> wrote:
>
>> I was checking that web, do you know if there's another possibility
>> since last updated for Cassandra was two years ago and I'd like to
>> compare bothof them with kind of same tool/code.
>>
>> 2015-01-28 22:10 GMT+01:00 Ted Yu :
>> > Guillermo:
>> > If you use hbase 0.98.x, please consider Andrew's ycsb repo:
>> >
>> > https://github.com/apurtell/ycsb/tree/new_hbase_client
>> >
>> > Cheers
>> >
>> > On Wed, Jan 28, 2015 at 12:41 PM, Nishanth S 
>> > wrote:
>> >
>> >> You can use ycsb for this purpose.See here
>> >>
>> >> https://github.com/brianfrankcooper/YCSB/wiki/Getting-Started
>> >> -Nishanth
>> >>
>> >> On Wed, Jan 28, 2015 at 1:37 PM, Guillermo Ortiz 
>> >> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > I'd like to do some benchmarks fo HBase but I don't know what tool
>> >> > could use. I started to make some code but I guess that there're some
>> >> > easier.
>> >> >
>> >> > I've taken a look to JMeter, but I guess that I'd attack directly from
>> >> > Java, JMeter looks great but I don't know if it fits well in this
>> >> > scenario. What tool could I use to take some measures as time to
>> >> > response some read and write request, etc. I'd like that to be able to
>> >> > make the same benchmarks to Cassandra.
>> >> >
>> >>
>>


Re: Tool to to execute an benchmark for HBase.

2015-01-28 Thread Guillermo Ortiz
I was checking that web, do you know if there's another possibility
since last updated for Cassandra was two years ago and I'd like to
compare bothof them with kind of same tool/code.

2015-01-28 22:10 GMT+01:00 Ted Yu :
> Guillermo:
> If you use hbase 0.98.x, please consider Andrew's ycsb repo:
>
> https://github.com/apurtell/ycsb/tree/new_hbase_client
>
> Cheers
>
> On Wed, Jan 28, 2015 at 12:41 PM, Nishanth S 
> wrote:
>
>> You can use ycsb for this purpose.See here
>>
>> https://github.com/brianfrankcooper/YCSB/wiki/Getting-Started
>> -Nishanth
>>
>> On Wed, Jan 28, 2015 at 1:37 PM, Guillermo Ortiz 
>> wrote:
>>
>> > Hi,
>> >
>> > I'd like to do some benchmarks fo HBase but I don't know what tool
>> > could use. I started to make some code but I guess that there're some
>> > easier.
>> >
>> > I've taken a look to JMeter, but I guess that I'd attack directly from
>> > Java, JMeter looks great but I don't know if it fits well in this
>> > scenario. What tool could I use to take some measures as time to
>> > response some read and write request, etc. I'd like that to be able to
>> > make the same benchmarks to Cassandra.
>> >
>>


Tool to to execute an benchmark for HBase.

2015-01-28 Thread Guillermo Ortiz
Hi,

I'd like to do some benchmarks fo HBase but I don't know what tool
could use. I started to make some code but I guess that there're some
easier.

I've taken a look to JMeter, but I guess that I'd attack directly from
Java, JMeter looks great but I don't know if it fits well in this
scenario. What tool could I use to take some measures as time to
response some read and write request, etc. I'd like that to be able to
make the same benchmarks to Cassandra.


Paint dashboards from HBase

2014-11-24 Thread Guillermo Ortiz
Is there any tool to draw data from Hbase to a dashboard like Kibana?
I have been looking for, but I didn't found a tool which fits directly
with HBase for that purpose.


Hbase with Phoenix, when?

2014-11-22 Thread Guillermo Ortiz
I just read about Phoenix project. People from Phoenix talks very well
about it, the question is,,, they give you SQL and it's suppose pretty
fast.. any case where is it better to use just HBase without Phoenix?
it is in general faster than execute native scans and make your own
coprocessor?


Re: Scan vs Parallel scan.

2014-09-16 Thread Guillermo Ortiz
I attach the code than I'm executing. I don't have accss to the generator
to HBase.
In the last benchmark, simple scan takes about 4 times less than this
version.

With that version is available just to do complete scans.
I have been trying a complete scan of a HTable with 100.000 rows and it
takes less than one second, is it not too fast???




2014-09-14 20:21 GMT+02:00 Guillermo Ortiz :

> I don't have the code here. But I created a class RegionScanner, this
> class does a complete scan of a region. So I have to set the start and stop
> keys. the start and stop key are the limits of that region.
>
> El domingo, 14 de septiembre de 2014, Anoop John 
> escribió:
>
> Again full code snippet can better speak.
>>
>> But not getting what u r doing with below code
>>
>> private List generatePartitions() {
>> List regionScanners = new
>> ArrayList();
>> byte[] startKey;
>> byte[] stopKey;
>> HConnection connection = null;
>> HBaseAdmin hbaseAdmin = null;
>> try {
>> connection = HConnectionManager.
>> createConnection(HBaseConfiguration.create());
>> hbaseAdmin = new HBaseAdmin(connection);
>> List regions =
>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> RegionScanner regionScanner = null;
>> for (HRegionInfo region : regions) {
>>
>> startKey = region.getStartKey();
>> stopKey = region.getEndKey();
>>
>> regionScanner = new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>> // regionScanner = createRegionScanner(startKey, stopKey);
>> if (regionScanner != null) {
>> regionScanners.add(regionScanner);
>> }
>> }
>>
>> And I execute the RegionScanner with this:
>> public List call() throws Exception {
>> HConnection connection =
>> HConnectionManager.
>> createConnection(HBaseConfiguration.create());
>> HTableInterface table =
>> connection.getTable(configuration.getTable());
>>
>> Scan scan = new Scan(startKey, stopKey);
>> scan.setBatch(configuration.getBatch());
>> scan.setCaching(configuration.getCaching());
>> ResultScanner resultScanner = table.getScanner(scan);
>>
>>
>> What is this part?
>> new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>>
>>
>> >>Scan scan = new Scan(startKey, stopKey);
>> scan.setBatch(configuration.
>> getBatch());
>> scan.setCaching(configuration.getCaching());
>> ResultScanner resultScanner = table.getScanner(scan);
>>
>>
>> And not setting start and stop rows to this Scan object? !!
>>
>>
>> Sorry If I missed some parts from ur code.
>>
>> -Anoop-
>>
>>
>> On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz 
>> wrote:
>>
>> > I don't have the code here,, but I'll put the code in a couple of days.
>> I
>> > have to check the executeservice again! I don't remember exactly how I
>> did.
>> >
>> > I'm using Hbase 0.98.
>> >
>> > El domingo, 14 de septiembre de 2014, lars hofhansl 
>> > escribió:
>> >
>> > > What specific version of 0.94 are you using?
>> > >
>> > > In general, if you have multiple spindles (disks) and/or multiple CPU
>> > > cores at the region server you should benefits from keeping multiple
>> > region
>> > > server handler threads busy. I have experimented with this before and
>> > saw a
>> > > close to linear speed up (up to the point where all disks/core were
>> > busy).
>> > > Obviously this also assuming this is the only load you throw at the
>> > servers
>> > > at this point.
>> > >
>> > > Can you post your complete code to pastebin? Maybe even with some
>> code to
>> > > seed the data?
>> > > How do you run your callables? Did you configure the ExecuteService
>> > > correctly (assuming you use one to run your callables)?
>> > >
>> > > Then we can run it and have a look.
>> > >
>> > > Thanks.
>> > >
>> > > -- Lars
>> > >
>> > >
>> > > - Original Message -
>> > > From: Guillermo Ortiz >
>> > > To: "user@hbase.apache.org " > > > &g

Re: Scan vs Parallel scan.

2014-09-14 Thread Guillermo Ortiz
I don't have the code here. But I created a class RegionScanner, this class
does a complete scan of a region. So I have to set the start and stop keys.
the start and stop key are the limits of that region.

El domingo, 14 de septiembre de 2014, Anoop John 
escribió:

> Again full code snippet can better speak.
>
> But not getting what u r doing with below code
>
> private List generatePartitions() {
> List regionScanners = new
> ArrayList();
> byte[] startKey;
> byte[] stopKey;
> HConnection connection = null;
> HBaseAdmin hbaseAdmin = null;
> try {
> connection = HConnectionManager.
> createConnection(HBaseConfiguration.create());
> hbaseAdmin = new HBaseAdmin(connection);
> List regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> RegionScanner regionScanner = null;
> for (HRegionInfo region : regions) {
>
> startKey = region.getStartKey();
> stopKey = region.getEndKey();
>
> regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
> // regionScanner = createRegionScanner(startKey, stopKey);
> if (regionScanner != null) {
> regionScanners.add(regionScanner);
> }
> }
>
> And I execute the RegionScanner with this:
> public List call() throws Exception {
> HConnection connection =
> HConnectionManager.
> createConnection(HBaseConfiguration.create());
> HTableInterface table =
> connection.getTable(configuration.getTable());
>
> Scan scan = new Scan(startKey, stopKey);
> scan.setBatch(configuration.getBatch());
> scan.setCaching(configuration.getCaching());
> ResultScanner resultScanner = table.getScanner(scan);
>
>
> What is this part?
> new RegionScanner(startKey, stopKey,
> scanConfiguration);
>
>
> >>Scan scan = new Scan(startKey, stopKey);
> scan.setBatch(configuration.
> getBatch());
> scan.setCaching(configuration.getCaching());
> ResultScanner resultScanner = table.getScanner(scan);
>
>
> And not setting start and stop rows to this Scan object? !!
>
>
> Sorry If I missed some parts from ur code.
>
> -Anoop-
>
>
> On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz  >
> wrote:
>
> > I don't have the code here,, but I'll put the code in a couple of days. I
> > have to check the executeservice again! I don't remember exactly how I
> did.
> >
> > I'm using Hbase 0.98.
> >
> > El domingo, 14 de septiembre de 2014, lars hofhansl  >
> > escribió:
> >
> > > What specific version of 0.94 are you using?
> > >
> > > In general, if you have multiple spindles (disks) and/or multiple CPU
> > > cores at the region server you should benefits from keeping multiple
> > region
> > > server handler threads busy. I have experimented with this before and
> > saw a
> > > close to linear speed up (up to the point where all disks/core were
> > busy).
> > > Obviously this also assuming this is the only load you throw at the
> > servers
> > > at this point.
> > >
> > > Can you post your complete code to pastebin? Maybe even with some code
> to
> > > seed the data?
> > > How do you run your callables? Did you configure the ExecuteService
> > > correctly (assuming you use one to run your callables)?
> > >
> > > Then we can run it and have a look.
> > >
> > > Thanks.
> > >
> > > -- Lars
> > >
> > >
> > > - Original Message -
> > > From: Guillermo Ortiz 
> >
> > > To: "user@hbase.apache.org  " <
> user@hbase.apache.org 
> > > >
> > > Cc:
> > > Sent: Saturday, September 13, 2014 4:49 PM
> > > Subject: Re: Scan vs Parallel scan.
> > >
> > > What am I missing??
> > >
> > >
> > >
> > >
> > > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz  
> > > >:
> > >
> > > > For an partial scan, I guess that I call to the RS to get data, it
> > starts
> > > > looking in the store files and recollecting the data. (It doesn't
> write
> > > to
> > > > the blockcache in both cases). It has ready the data and it gives to
> > the
> > > > client the data step by step, I mean,,, it depends the caching and
> > > batching
> > > > parameters.
> > &

Re: Scan vs Parallel scan.

2014-09-14 Thread Guillermo Ortiz
I don't have the code here,, but I'll put the code in a couple of days. I
have to check the executeservice again! I don't remember exactly how I did.

I'm using Hbase 0.98.

El domingo, 14 de septiembre de 2014, lars hofhansl 
escribió:

> What specific version of 0.94 are you using?
>
> In general, if you have multiple spindles (disks) and/or multiple CPU
> cores at the region server you should benefits from keeping multiple region
> server handler threads busy. I have experimented with this before and saw a
> close to linear speed up (up to the point where all disks/core were busy).
> Obviously this also assuming this is the only load you throw at the servers
> at this point.
>
> Can you post your complete code to pastebin? Maybe even with some code to
> seed the data?
> How do you run your callables? Did you configure the ExecuteService
> correctly (assuming you use one to run your callables)?
>
> Then we can run it and have a look.
>
> Thanks.
>
> -- Lars
>
>
> - Original Message -
> From: Guillermo Ortiz >
> To: "user@hbase.apache.org "  >
> Cc:
> Sent: Saturday, September 13, 2014 4:49 PM
> Subject: Re: Scan vs Parallel scan.
>
> What am I missing??
>
>
>
>
> 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz  >:
>
> > For an partial scan, I guess that I call to the RS to get data, it starts
> > looking in the store files and recollecting the data. (It doesn't write
> to
> > the blockcache in both cases). It has ready the data and it gives to the
> > client the data step by step, I mean,,, it depends the caching and
> batching
> > parameters.
> >
> > Big differences that I see...
> > I'm opening more connections to the Table, one for Region.
> >
> > I should check the single table scan, it looks like it does partial scans
> > sequentially. Since you can see on the HBase Master how the request
> > increase one after another, not all in the same time.
> >
> > 2014-09-12 15:23 GMT+02:00 Michael Segel  >:
> >
> >> It doesn’t matter which RS, but that you have 1 thread for each region.
> >>
> >> So for each thread, what’s happening.
> >> Step by step, what is the code doing.
> >>
> >> Now you’re comparing this against a single table scan, right?
> >> What’s happening in the table scan…?
> >>
> >>
> >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz  >
> >> wrote:
> >>
> >> > Right, My table for example has keys between 0-9. in three regions
> >> > 0-2,3-7,7-9
> >> > I lauch three partial scans in parallel. The scans that I'm executing
> >> are:
> >> > scan(0,2), scan(3,7), scan(7,9).
> >> > Each region is if a different RS, so each thread goes to different RS.
> >> It's
> >> > not exactly like that, but on the benchmark case it's like it's
> working.
> >> >
> >> > Really the code will execute a thread for each Region not for each
> >> > RegionServer. But in the test I only have two regions for
> regionServer.
> >> I
> >> > dont' think that's an important point, there're two threads for RS.
> >> >
> >> > 2014-09-12 14:48 GMT+02:00 Michael Segel  >:
> >> >
> >> >> Ok, lets again take a step back…
> >> >>
> >> >> So you are comparing your partial scan(s) against a full table scan?
> >> >>
> >> >> If I understood your question, you launch 3 partial scans where you
> set
> >> >> the start row and then end row of each scan, right?
> >> >>
> >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz  >
> >> wrote:
> >> >>
> >> >>> Okay, then, the partial scan doesn't work as I think.
> >> >>> How could it exceed the limit of a single region if I calculate the
> >> >> limits?
> >> >>>
> >> >>>
> >> >>> The only bad point that I see it's that If a region server has three
> >> >>> regions of the same table,  I'm executing three partial scans about
> >> this
> >> >> RS
> >> >>> and they could compete for resources (network, etc..) on this node.
> >> It'd
> >> >> be
> >> >>> better to have one thread for RS. But, that doesn't answer your
> >> >> questions.
> >> >>>
> >> >>> I keep thinking...
> >> >>>

Re: Scan vs Parallel scan.

2014-09-13 Thread Guillermo Ortiz
What am I missing??

2014-09-12 16:05 GMT+02:00 Guillermo Ortiz :

> For an partial scan, I guess that I call to the RS to get data, it starts
> looking in the store files and recollecting the data. (It doesn't write to
> the blockcache in both cases). It has ready the data and it gives to the
> client the data step by step, I mean,,, it depends the caching and batching
> parameters.
>
> Big differences that I see...
> I'm opening more connections to the Table, one for Region.
>
> I should check the single table scan, it looks like it does partial scans
> sequentially. Since you can see on the HBase Master how the request
> increase one after another, not all in the same time.
>
> 2014-09-12 15:23 GMT+02:00 Michael Segel :
>
>> It doesn’t matter which RS, but that you have 1 thread for each region.
>>
>> So for each thread, what’s happening.
>> Step by step, what is the code doing.
>>
>> Now you’re comparing this against a single table scan, right?
>> What’s happening in the table scan…?
>>
>>
>> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz 
>> wrote:
>>
>> > Right, My table for example has keys between 0-9. in three regions
>> > 0-2,3-7,7-9
>> > I lauch three partial scans in parallel. The scans that I'm executing
>> are:
>> > scan(0,2), scan(3,7), scan(7,9).
>> > Each region is if a different RS, so each thread goes to different RS.
>> It's
>> > not exactly like that, but on the benchmark case it's like it's working.
>> >
>> > Really the code will execute a thread for each Region not for each
>> > RegionServer. But in the test I only have two regions for regionServer.
>> I
>> > dont' think that's an important point, there're two threads for RS.
>> >
>> > 2014-09-12 14:48 GMT+02:00 Michael Segel :
>> >
>> >> Ok, lets again take a step back…
>> >>
>> >> So you are comparing your partial scan(s) against a full table scan?
>> >>
>> >> If I understood your question, you launch 3 partial scans where you set
>> >> the start row and then end row of each scan, right?
>> >>
>> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz 
>> wrote:
>> >>
>> >>> Okay, then, the partial scan doesn't work as I think.
>> >>> How could it exceed the limit of a single region if I calculate the
>> >> limits?
>> >>>
>> >>>
>> >>> The only bad point that I see it's that If a region server has three
>> >>> regions of the same table,  I'm executing three partial scans about
>> this
>> >> RS
>> >>> and they could compete for resources (network, etc..) on this node.
>> It'd
>> >> be
>> >>> better to have one thread for RS. But, that doesn't answer your
>> >> questions.
>> >>>
>> >>> I keep thinking...
>> >>>
>> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel :
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I wanted to take a step back from the actual code and to stop and
>> think
>> >>>> about what you are doing and what HBase is doing under the covers.
>> >>>>
>> >>>> So in your code, you are asking HBase to do 3 separate scans and then
>> >> you
>> >>>> take the result set back and join it.
>> >>>>
>> >>>> What does HBase do when it does a range scan?
>> >>>> What happens when that range scan exceeds a single region?
>> >>>>
>> >>>> If you answer those questions… you’ll have your answer.
>> >>>>
>> >>>> HTH
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz 
>> >> wrote:
>> >>>>
>> >>>>> It's not all the code, I set things like these as well:
>> >>>>> scan.setMaxVersions();
>> >>>>> scan.setCacheBlocks(false);
>> >>>>> ...
>> >>>>>
>> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz :
>> >>>>>
>> >>>>>> yes, that is. I have changed the HBase version to 0.98
>> >>>>>>
>> >>>>>> I got the start and stop keys with this method:
>> >

Re: Scan vs Parallel scan.

2014-09-12 Thread Guillermo Ortiz
For an partial scan, I guess that I call to the RS to get data, it starts
looking in the store files and recollecting the data. (It doesn't write to
the blockcache in both cases). It has ready the data and it gives to the
client the data step by step, I mean,,, it depends the caching and batching
parameters.

Big differences that I see...
I'm opening more connections to the Table, one for Region.

I should check the single table scan, it looks like it does partial scans
sequentially. Since you can see on the HBase Master how the request
increase one after another, not all in the same time.

2014-09-12 15:23 GMT+02:00 Michael Segel :

> It doesn’t matter which RS, but that you have 1 thread for each region.
>
> So for each thread, what’s happening.
> Step by step, what is the code doing.
>
> Now you’re comparing this against a single table scan, right?
> What’s happening in the table scan…?
>
>
> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz  wrote:
>
> > Right, My table for example has keys between 0-9. in three regions
> > 0-2,3-7,7-9
> > I lauch three partial scans in parallel. The scans that I'm executing
> are:
> > scan(0,2), scan(3,7), scan(7,9).
> > Each region is if a different RS, so each thread goes to different RS.
> It's
> > not exactly like that, but on the benchmark case it's like it's working.
> >
> > Really the code will execute a thread for each Region not for each
> > RegionServer. But in the test I only have two regions for regionServer. I
> > dont' think that's an important point, there're two threads for RS.
> >
> > 2014-09-12 14:48 GMT+02:00 Michael Segel :
> >
> >> Ok, lets again take a step back…
> >>
> >> So you are comparing your partial scan(s) against a full table scan?
> >>
> >> If I understood your question, you launch 3 partial scans where you set
> >> the start row and then end row of each scan, right?
> >>
> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz 
> wrote:
> >>
> >>> Okay, then, the partial scan doesn't work as I think.
> >>> How could it exceed the limit of a single region if I calculate the
> >> limits?
> >>>
> >>>
> >>> The only bad point that I see it's that If a region server has three
> >>> regions of the same table,  I'm executing three partial scans about
> this
> >> RS
> >>> and they could compete for resources (network, etc..) on this node.
> It'd
> >> be
> >>> better to have one thread for RS. But, that doesn't answer your
> >> questions.
> >>>
> >>> I keep thinking...
> >>>
> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel :
> >>>
> >>>> Hi,
> >>>>
> >>>> I wanted to take a step back from the actual code and to stop and
> think
> >>>> about what you are doing and what HBase is doing under the covers.
> >>>>
> >>>> So in your code, you are asking HBase to do 3 separate scans and then
> >> you
> >>>> take the result set back and join it.
> >>>>
> >>>> What does HBase do when it does a range scan?
> >>>> What happens when that range scan exceeds a single region?
> >>>>
> >>>> If you answer those questions… you’ll have your answer.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz 
> >> wrote:
> >>>>
> >>>>> It's not all the code, I set things like these as well:
> >>>>> scan.setMaxVersions();
> >>>>> scan.setCacheBlocks(false);
> >>>>> ...
> >>>>>
> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz :
> >>>>>
> >>>>>> yes, that is. I have changed the HBase version to 0.98
> >>>>>>
> >>>>>> I got the start and stop keys with this method:
> >>>>>> private List generatePartitions() {
> >>>>>>  List regionScanners = new
> >>>>>> ArrayList();
> >>>>>>  byte[] startKey;
> >>>>>>  byte[] stopKey;
> >>>>>>  HConnection connection = null;
> >>>>>>  HBaseAdmin hbaseAdmin = null;
> >>>>>>  try {
> >>>>>>  connection = HConnectionManager.
> >>

Re: Scan vs Parallel scan.

2014-09-12 Thread Guillermo Ortiz
Right, My table for example has keys between 0-9. in three regions
0-2,3-7,7-9
I lauch three partial scans in parallel. The scans that I'm executing are:
scan(0,2), scan(3,7), scan(7,9).
Each region is if a different RS, so each thread goes to different RS. It's
not exactly like that, but on the benchmark case it's like it's working.

Really the code will execute a thread for each Region not for each
RegionServer. But in the test I only have two regions for regionServer. I
dont' think that's an important point, there're two threads for RS.

2014-09-12 14:48 GMT+02:00 Michael Segel :

> Ok, lets again take a step back…
>
> So you are comparing your partial scan(s) against a full table scan?
>
> If I understood your question, you launch 3 partial scans where you set
> the start row and then end row of each scan, right?
>
> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz  wrote:
>
> > Okay, then, the partial scan doesn't work as I think.
> > How could it exceed the limit of a single region if I calculate the
> limits?
> >
> >
> > The only bad point that I see it's that If a region server has three
> > regions of the same table,  I'm executing three partial scans about this
> RS
> > and they could compete for resources (network, etc..) on this node. It'd
> be
> > better to have one thread for RS. But, that doesn't answer your
> questions.
> >
> > I keep thinking...
> >
> > 2014-09-12 9:40 GMT+02:00 Michael Segel :
> >
> >> Hi,
> >>
> >> I wanted to take a step back from the actual code and to stop and think
> >> about what you are doing and what HBase is doing under the covers.
> >>
> >> So in your code, you are asking HBase to do 3 separate scans and then
> you
> >> take the result set back and join it.
> >>
> >> What does HBase do when it does a range scan?
> >> What happens when that range scan exceeds a single region?
> >>
> >> If you answer those questions… you’ll have your answer.
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz 
> wrote:
> >>
> >>> It's not all the code, I set things like these as well:
> >>> scan.setMaxVersions();
> >>> scan.setCacheBlocks(false);
> >>> ...
> >>>
> >>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz :
> >>>
> >>>> yes, that is. I have changed the HBase version to 0.98
> >>>>
> >>>> I got the start and stop keys with this method:
> >>>> private List generatePartitions() {
> >>>>   List regionScanners = new
> >>>> ArrayList();
> >>>>   byte[] startKey;
> >>>>   byte[] stopKey;
> >>>>   HConnection connection = null;
> >>>>   HBaseAdmin hbaseAdmin = null;
> >>>>   try {
> >>>>   connection = HConnectionManager.
> >>>> createConnection(HBaseConfiguration.create());
> >>>>   hbaseAdmin = new HBaseAdmin(connection);
> >>>>   List regions =
> >>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>   RegionScanner regionScanner = null;
> >>>>   for (HRegionInfo region : regions) {
> >>>>
> >>>>   startKey = region.getStartKey();
> >>>>   stopKey = region.getEndKey();
> >>>>
> >>>>   regionScanner = new RegionScanner(startKey, stopKey,
> >>>> scanConfiguration);
> >>>>   // regionScanner = createRegionScanner(startKey,
> >> stopKey);
> >>>>   if (regionScanner != null) {
> >>>>   regionScanners.add(regionScanner);
> >>>>   }
> >>>>   }
> >>>>
> >>>> And I execute the RegionScanner with this:
> >>>> public List call() throws Exception {
> >>>>   HConnection connection =
> >>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>   HTableInterface table =
> >>>> connection.getTable(configuration.getTable());
> >>>>
> >>>>   Scan scan = new Scan(startKey, stopKey);
> >>>>   scan.setBatch(configuration.getBatch());
> >>>>   scan.setCaching(configuration.getCaching());
> >>>>   ResultScanner resu

Re: Scan vs Parallel scan.

2014-09-12 Thread Guillermo Ortiz
Okay, then, the partial scan doesn't work as I think.
How could it exceed the limit of a single region if I calculate the limits?


The only bad point that I see it's that If a region server has three
regions of the same table,  I'm executing three partial scans about this RS
and they could compete for resources (network, etc..) on this node. It'd be
better to have one thread for RS. But, that doesn't answer your questions.

I keep thinking...

2014-09-12 9:40 GMT+02:00 Michael Segel :

> Hi,
>
> I wanted to take a step back from the actual code and to stop and think
> about what you are doing and what HBase is doing under the covers.
>
> So in your code, you are asking HBase to do 3 separate scans and then you
> take the result set back and join it.
>
> What does HBase do when it does a range scan?
> What happens when that range scan exceeds a single region?
>
> If you answer those questions… you’ll have your answer.
>
> HTH
>
> -Mike
>
> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz  wrote:
>
> > It's not all the code, I set things like these as well:
> > scan.setMaxVersions();
> > scan.setCacheBlocks(false);
> > ...
> >
> > 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz :
> >
> >> yes, that is. I have changed the HBase version to 0.98
> >>
> >> I got the start and stop keys with this method:
> >> private List generatePartitions() {
> >>List regionScanners = new
> >> ArrayList();
> >>byte[] startKey;
> >>byte[] stopKey;
> >>HConnection connection = null;
> >>HBaseAdmin hbaseAdmin = null;
> >>try {
> >>connection = HConnectionManager.
> >> createConnection(HBaseConfiguration.create());
> >>hbaseAdmin = new HBaseAdmin(connection);
> >>List regions =
> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>RegionScanner regionScanner = null;
> >>for (HRegionInfo region : regions) {
> >>
> >>startKey = region.getStartKey();
> >>stopKey = region.getEndKey();
> >>
> >>regionScanner = new RegionScanner(startKey, stopKey,
> >> scanConfiguration);
> >>// regionScanner = createRegionScanner(startKey,
> stopKey);
> >>if (regionScanner != null) {
> >>regionScanners.add(regionScanner);
> >>}
> >>}
> >>
> >> And I execute the RegionScanner with this:
> >> public List call() throws Exception {
> >>HConnection connection =
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>HTableInterface table =
> >> connection.getTable(configuration.getTable());
> >>
> >>Scan scan = new Scan(startKey, stopKey);
> >>scan.setBatch(configuration.getBatch());
> >>scan.setCaching(configuration.getCaching());
> >>ResultScanner resultScanner = table.getScanner(scan);
> >>
> >>List results = new ArrayList();
> >>for (Result result : resultScanner) {
> >>results.add(result);
> >>}
> >>
> >>connection.close();
> >>table.close();
> >>
> >>return results;
> >>}
> >>
> >> They implement Callable.
> >>
> >>
> >> 2014-09-12 9:26 GMT+02:00 Michael Segel :
> >>
> >>> Lets take a step back….
> >>>
> >>> Your parallel scan is having the client create N threads where in each
> >>> thread, you’re doing a partial scan of the table where each partial
> scan
> >>> takes the first and last row of each region?
> >>>
> >>> Is that correct?
> >>>
> >>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz 
> >>> wrote:
> >>>
> >>>> I was checking a little bit more about,, I checked the cluster and
> data
> >>> is
> >>>> store in three different regions servers, each one in a differente
> node.
> >>>> So, I guess the threads go to different hard-disks.
> >>>>
> >>>> If someone has an idea or suggestion.. why it's faster a single scan
> >>> than
> >>>> this implementation. I based on this implementation
> >>>> https://github.com/zygm0nt/hbase-distributed-search
> >>>>
> >

Re: Scan vs Parallel scan.

2014-09-12 Thread Guillermo Ortiz
It's not all the code, I set things like these as well:
scan.setMaxVersions();
scan.setCacheBlocks(false);
...

2014-09-12 9:33 GMT+02:00 Guillermo Ortiz :

> yes, that is. I have changed the HBase version to 0.98
>
> I got the start and stop keys with this method:
> private List generatePartitions() {
> List regionScanners = new
> ArrayList();
> byte[] startKey;
> byte[] stopKey;
> HConnection connection = null;
> HBaseAdmin hbaseAdmin = null;
> try {
> connection = HConnectionManager.
> createConnection(HBaseConfiguration.create());
> hbaseAdmin = new HBaseAdmin(connection);
> List regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> RegionScanner regionScanner = null;
> for (HRegionInfo region : regions) {
>
> startKey = region.getStartKey();
> stopKey = region.getEndKey();
>
> regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
> // regionScanner = createRegionScanner(startKey, stopKey);
> if (regionScanner != null) {
> regionScanners.add(regionScanner);
> }
> }
>
> And I execute the RegionScanner with this:
> public List call() throws Exception {
> HConnection connection =
> HConnectionManager.createConnection(HBaseConfiguration.create());
> HTableInterface table =
> connection.getTable(configuration.getTable());
>
> Scan scan = new Scan(startKey, stopKey);
> scan.setBatch(configuration.getBatch());
> scan.setCaching(configuration.getCaching());
> ResultScanner resultScanner = table.getScanner(scan);
>
> List results = new ArrayList();
> for (Result result : resultScanner) {
> results.add(result);
> }
>
> connection.close();
> table.close();
>
> return results;
> }
>
> They implement Callable.
>
>
> 2014-09-12 9:26 GMT+02:00 Michael Segel :
>
>> Lets take a step back….
>>
>> Your parallel scan is having the client create N threads where in each
>> thread, you’re doing a partial scan of the table where each partial scan
>> takes the first and last row of each region?
>>
>> Is that correct?
>>
>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz 
>> wrote:
>>
>> > I was checking a little bit more about,, I checked the cluster and data
>> is
>> > store in three different regions servers, each one in a differente node.
>> > So, I guess the threads go to different hard-disks.
>> >
>> > If someone has an idea or suggestion.. why it's faster a single scan
>> than
>> > this implementation. I based on this implementation
>> > https://github.com/zygm0nt/hbase-distributed-search
>> >
>> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz :
>> >
>> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>> although
>> >> there is not difference.
>> >> I disabled the table and disabled the blockcache for that family and I
>> put
>> >> scan.setBlockcache(false) as well for both cases.
>> >>
>> >> I think that it's not possible that I executing an complete scan for
>> each
>> >> thread since my data are the type:
>> >> 01 f:q value=1
>> >> 02 f:q value=2
>> >> 03 f:q value=3
>> >> ...
>> >>
>> >> I add all the values and get the same result on a single scan than a
>> >> distributed, so, I guess that DistributedScan did well.
>> >> The count from the hbase shell takes about 10-15seconds, I don't
>> remember,
>> >> but like 4x  of the scan time.
>> >> I'm not using any filter for the scans.
>> >>
>> >> This is the way I calculate number of regions/scans
>> >> private List generatePartitions() {
>> >>List regionScanners = new
>> >> ArrayList();
>> >>byte[] startKey;
>> >>byte[] stopKey;
>> >>HConnection connection = null;
>> >>HBaseAdmin hbaseAdmin = null;
>> >>try {
>> >>connection =
>> >> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>hbaseAdmin = new HBaseAdmin(connection);
>> >>List regions =
>> >> hbaseAdmin.getTableRegions(scanConfiguration.get

Re: Scan vs Parallel scan.

2014-09-12 Thread Guillermo Ortiz
yes, that is. I have changed the HBase version to 0.98

I got the start and stop keys with this method:
private List generatePartitions() {
List regionScanners = new ArrayList();
byte[] startKey;
byte[] stopKey;
HConnection connection = null;
HBaseAdmin hbaseAdmin = null;
try {
connection = HConnectionManager.
createConnection(HBaseConfiguration.create());
hbaseAdmin = new HBaseAdmin(connection);
List regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
RegionScanner regionScanner = null;
for (HRegionInfo region : regions) {

startKey = region.getStartKey();
stopKey = region.getEndKey();

regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
// regionScanner = createRegionScanner(startKey, stopKey);
if (regionScanner != null) {
regionScanners.add(regionScanner);
}
}

And I execute the RegionScanner with this:
public List call() throws Exception {
HConnection connection =
HConnectionManager.createConnection(HBaseConfiguration.create());
HTableInterface table =
connection.getTable(configuration.getTable());

Scan scan = new Scan(startKey, stopKey);
scan.setBatch(configuration.getBatch());
scan.setCaching(configuration.getCaching());
ResultScanner resultScanner = table.getScanner(scan);

List results = new ArrayList();
for (Result result : resultScanner) {
results.add(result);
}

connection.close();
table.close();

return results;
}

They implement Callable.


2014-09-12 9:26 GMT+02:00 Michael Segel :

> Lets take a step back….
>
> Your parallel scan is having the client create N threads where in each
> thread, you’re doing a partial scan of the table where each partial scan
> takes the first and last row of each region?
>
> Is that correct?
>
> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz  wrote:
>
> > I was checking a little bit more about,, I checked the cluster and data
> is
> > store in three different regions servers, each one in a differente node.
> > So, I guess the threads go to different hard-disks.
> >
> > If someone has an idea or suggestion.. why it's faster a single scan than
> > this implementation. I based on this implementation
> > https://github.com/zygm0nt/hbase-distributed-search
> >
> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz :
> >
> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
> >> there is not difference.
> >> I disabled the table and disabled the blockcache for that family and I
> put
> >> scan.setBlockcache(false) as well for both cases.
> >>
> >> I think that it's not possible that I executing an complete scan for
> each
> >> thread since my data are the type:
> >> 01 f:q value=1
> >> 02 f:q value=2
> >> 03 f:q value=3
> >> ...
> >>
> >> I add all the values and get the same result on a single scan than a
> >> distributed, so, I guess that DistributedScan did well.
> >> The count from the hbase shell takes about 10-15seconds, I don't
> remember,
> >> but like 4x  of the scan time.
> >> I'm not using any filter for the scans.
> >>
> >> This is the way I calculate number of regions/scans
> >> private List generatePartitions() {
> >>List regionScanners = new
> >> ArrayList();
> >>byte[] startKey;
> >>byte[] stopKey;
> >>HConnection connection = null;
> >>HBaseAdmin hbaseAdmin = null;
> >>try {
> >>connection =
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>hbaseAdmin = new HBaseAdmin(connection);
> >>List regions =
> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>RegionScanner regionScanner = null;
> >>for (HRegionInfo region : regions) {
> >>
> >>startKey = region.getStartKey();
> >>stopKey = region.getEndKey();
> >>
> >>regionScanner = new RegionScanner(startKey, stopKey,
> >> scanConfiguration);
> >>// regionScanner = createRegionScanner(startKey,
> stopKey);
> >>if (regionScanner != null) {
> >>regionScanners.add(regionScanner);
> >>}
> >>}
> >>
>

Re: Scan vs Parallel scan.

2014-09-11 Thread Guillermo Ortiz
I was checking a little bit more about,, I checked the cluster and data is
store in three different regions servers, each one in a differente node.
So, I guess the threads go to different hard-disks.

If someone has an idea or suggestion.. why it's faster a single scan than
this implementation. I based on this implementation
https://github.com/zygm0nt/hbase-distributed-search

2014-09-11 12:05 GMT+02:00 Guillermo Ortiz :

> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
> there is not difference.
> I disabled the table and disabled the blockcache for that family and I put
> scan.setBlockcache(false) as well for both cases.
>
> I think that it's not possible that I executing an complete scan for each
> thread since my data are the type:
> 01 f:q value=1
> 02 f:q value=2
> 03 f:q value=3
> ...
>
> I add all the values and get the same result on a single scan than a
> distributed, so, I guess that DistributedScan did well.
> The count from the hbase shell takes about 10-15seconds, I don't remember,
> but like 4x  of the scan time.
> I'm not using any filter for the scans.
>
> This is the way I calculate number of regions/scans
> private List generatePartitions() {
> List regionScanners = new
> ArrayList();
> byte[] startKey;
> byte[] stopKey;
> HConnection connection = null;
> HBaseAdmin hbaseAdmin = null;
> try {
> connection =
> HConnectionManager.createConnection(HBaseConfiguration.create());
> hbaseAdmin = new HBaseAdmin(connection);
> List regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> RegionScanner regionScanner = null;
> for (HRegionInfo region : regions) {
>
> startKey = region.getStartKey();
> stopKey = region.getEndKey();
>
> regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
> // regionScanner = createRegionScanner(startKey, stopKey);
> if (regionScanner != null) {
> regionScanners.add(regionScanner);
> }
> }
>
> I did some test for a tiny table and I think that the range for each scan
> works fine. Although, I though that it was interesting that the time when I
> execute distributed scan is about 6x.
>
> I'm going to check about the hard disks, but I think that ti's right.
>
>
>
>
> 2014-09-11 7:50 GMT+02:00 lars hofhansl :
>
>> Which version of HBase?
>> Can you show us the code?
>>
>>
>> Your parallel scan with caching 100 takes about 6x as long as the single
>> scan, which is suspicious because you say you have 6 regions.
>> Are you sure you're not accidentally scanning all the data in each of
>> your parallel scans?
>>
>> -- Lars
>>
>>
>>
>> 
>>  From: Guillermo Ortiz 
>> To: "user@hbase.apache.org" 
>> Sent: Wednesday, September 10, 2014 1:40 AM
>> Subject: Scan vs Parallel scan.
>>
>>
>> Hi,
>>
>> I developed an distributed scan, I create an thread for each region. After
>> that, I've tried to get some times Scan vs DistributedScan.
>> I have disabled blockcache in my table. My cluster has 3 region servers
>> with 2 regions each one, in total there are 100.000 rows and execute a
>> complete scan.
>>
>> My partitions are
>> -01666 -> request 16665
>> 01-02 -> request 1
>> 02-049998 -> request 1
>> 049998-04 -> request 1
>> 04-083330 -> request 1
>> 083330- -> request 16671
>>
>>
>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10
>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
>> Caching 10
>>
>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10
>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
>> Caching 100
>>
>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10
>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
>> Caching 1000
>>
>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10
>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
>> Caching 1
>>
>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10
>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>> Caching 100
>>
>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10
>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>> Caching 1000
>>
>> Parallel scan works much worse than simple scan,, and I don't know why
>> it's
>> so fast,, it's really much faster than execute an "count" from hbase
>> shell,
>> what it doesn't look pretty notmal. The only time that it works better
>> parallel is when I execute a normal scan with caching 1.
>>
>> Any clue about it?
>>
>
>


Re: Scan vs Parallel scan.

2014-09-11 Thread Guillermo Ortiz
I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
there is not difference.
I disabled the table and disabled the blockcache for that family and I put
scan.setBlockcache(false) as well for both cases.

I think that it's not possible that I executing an complete scan for each
thread since my data are the type:
01 f:q value=1
02 f:q value=2
03 f:q value=3
...

I add all the values and get the same result on a single scan than a
distributed, so, I guess that DistributedScan did well.
The count from the hbase shell takes about 10-15seconds, I don't remember,
but like 4x  of the scan time.
I'm not using any filter for the scans.

This is the way I calculate number of regions/scans
private List generatePartitions() {
List regionScanners = new ArrayList();
byte[] startKey;
byte[] stopKey;
HConnection connection = null;
HBaseAdmin hbaseAdmin = null;
try {
connection =
HConnectionManager.createConnection(HBaseConfiguration.create());
hbaseAdmin = new HBaseAdmin(connection);
List regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
RegionScanner regionScanner = null;
for (HRegionInfo region : regions) {

startKey = region.getStartKey();
stopKey = region.getEndKey();

regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
// regionScanner = createRegionScanner(startKey, stopKey);
if (regionScanner != null) {
regionScanners.add(regionScanner);
}
}

I did some test for a tiny table and I think that the range for each scan
works fine. Although, I though that it was interesting that the time when I
execute distributed scan is about 6x.

I'm going to check about the hard disks, but I think that ti's right.




2014-09-11 7:50 GMT+02:00 lars hofhansl :

> Which version of HBase?
> Can you show us the code?
>
>
> Your parallel scan with caching 100 takes about 6x as long as the single
> scan, which is suspicious because you say you have 6 regions.
> Are you sure you're not accidentally scanning all the data in each of your
> parallel scans?
>
> -- Lars
>
>
>
> 
>  From: Guillermo Ortiz 
> To: "user@hbase.apache.org" 
> Sent: Wednesday, September 10, 2014 1:40 AM
> Subject: Scan vs Parallel scan.
>
>
> Hi,
>
> I developed an distributed scan, I create an thread for each region. After
> that, I've tried to get some times Scan vs DistributedScan.
> I have disabled blockcache in my table. My cluster has 3 region servers
> with 2 regions each one, in total there are 100.000 rows and execute a
> complete scan.
>
> My partitions are
> -01666 -> request 16665
> 01-02 -> request 1
> 02-049998 -> request 1
> 049998-04 -> request 1
> 04-083330 -> request 1
> 083330- -> request 16671
>
>
> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10
> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
> Caching 10
>
> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10
> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10
> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
> Caching 1000
>
> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10
> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> Caching 1
>
> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10
> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10
> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> Caching 1000
>
> Parallel scan works much worse than simple scan,, and I don't know why it's
> so fast,, it's really much faster than execute an "count" from hbase shell,
> what it doesn't look pretty notmal. The only time that it works better
> parallel is when I execute a normal scan with caching 1.
>
> Any clue about it?
>


Re: Scan vs Parallel scan.

2014-09-10 Thread Guillermo Ortiz
What I want to say that I don't understand why a count takes more time than
a complete scan without cache. I thought it should take more time to scan
the table than to execute a count.
Another point is why is slower an distributed scan than a sequential scan.
Tomorrow I'll check how many disk we have.

El miércoles, 10 de septiembre de 2014, Esteban Gutierrez <
este...@cloudera.com> escribió:

> Hello Guillermo,
>
> Sounds like some potential contention going on, how many disks per node you
> have?
>
> Can you explain further what do you mean by "and I don't know why it's so
> fast,, it's really much faster than execute an "count" from hbase shell,"
> the count command from the shell uses the FirstKeyOnlyFilter and a caching
> of 10 which should be close to the behavior of your testing tool if its
> using the same filter and the same cache settings.
>
> cheers,
> esteban.
>
>
>
>
> --
> Cloudera, Inc.
>
>
> On Wed, Sep 10, 2014 at 1:40 AM, Guillermo Ortiz  >
> wrote:
>
> > Hi,
> >
> > I developed an distributed scan, I create an thread for each region.
> After
> > that, I've tried to get some times Scan vs DistributedScan.
> > I have disabled blockcache in my table. My cluster has 3 region servers
> > with 2 regions each one, in total there are 100.000 rows and execute a
> > complete scan.
> >
> > My partitions are
> > -01666 -> request 16665
> > 01-02 -> request 1
> > 02-049998 -> request 1
> > 049998-04 -> request 1
> > 04-083330 -> request 1
> > 083330- -> request 16671
> >
> >
> > 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10
> > 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
> > Caching 10
> >
> > 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10
> > 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
> > Caching 100
> >
> > 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10
> > 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
> > Caching 1000
> >
> > 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10
> > 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> > Caching 1
> >
> > 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10
> > 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> > Caching 100
> >
> > 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10
> > 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> > Caching 1000
> >
> > Parallel scan works much worse than simple scan,, and I don't know why
> it's
> > so fast,, it's really much faster than execute an "count" from hbase
> shell,
> > what it doesn't look pretty notmal. The only time that it works better
> > parallel is when I execute a normal scan with caching 1.
> >
> > Any clue about it?
> >
>


Scan vs Parallel scan.

2014-09-10 Thread Guillermo Ortiz
Hi,

I developed an distributed scan, I create an thread for each region. After
that, I've tried to get some times Scan vs DistributedScan.
I have disabled blockcache in my table. My cluster has 3 region servers
with 2 regions each one, in total there are 100.000 rows and execute a
complete scan.

My partitions are
-01666 -> request 16665
01-02 -> request 1
02-049998 -> request 1
049998-04 -> request 1
04-083330 -> request 1
083330- -> request 16671


14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 10
14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
Caching 10

14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 10
14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
Caching 100

14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 10
14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
Caching 1000

14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 10
14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
Caching 1

14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 10
14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
Caching 100

14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 10
14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
Caching 1000

Parallel scan works much worse than simple scan,, and I don't know why it's
so fast,, it's really much faster than execute an "count" from hbase shell,
what it doesn't look pretty notmal. The only time that it works better
parallel is when I execute a normal scan with caching 1.

Any clue about it?


How to know regions in a RegionServer?

2014-08-28 Thread Guillermo Ortiz
How could I know with Java which Regions are served for each RegionServer?
I want to execute an parallel scan, one thread for each regionServer
because I think that's better than one for region, or is it not??


HFile Reader in 0.96 and 0.98

2014-07-24 Thread Guillermo Ortiz
Hi,

I am trying to make compatible HBase 0.96 with 0.98 but, I have seen that
the HFile.createReader has a new parameter "Configuration". I have been
looking for another way to get an Hfile.Reader from two versions with the
same code but I think that it's not possible, is there another way to get
an Hfile.Reader?

I would like that my code works with both versions and I could switch
versions quickly.


Coprocessor beacuse of TTL expired?

2014-07-21 Thread Guillermo Ortiz
I want to use coprocessors (observers), Could I execute an coprocessors
which executes my code when a row expired because the TTL has expired?

will it be executed automatically? I mean,, without any scan or get over
that row?? it's a pre and post delete or which observer?


Re: Store data in HBase with a MapReduce.

2014-06-27 Thread Guillermo Ortiz
If I have to how me reducers I should have?? as many as number of
regions?? I have read about HRegionPartitioner, but it has some
limitations, and you have to be sure that any region isn't going to split
while you're putting new data in your table. Is it only for performance?
what could it happen if you put too many data in your table and it splits
an region with a HRegionPartitioner?


2014-06-26 21:43 GMT+02:00 Stack :

> Be sure to read http://hbase.apache.org/book.html#d3314e5975 Guillermo if
> you have not already.  Avoid reduce phase if you can.
>
> St.Ack
>
>
> On Thu, Jun 26, 2014 at 8:24 AM, Guillermo Ortiz 
> wrote:
>
> > I have a question.
> > I want to execute an MapReduce and the output of my reduce it's going to
> > store in HBase.
> >
> > So, it's a MapReduce with an output which it's going to be stored in
> HBase.
> > I can do a Map and use HFileOutputFormat.configureIncrementalLoad(pJob,
> > table); but, I don't know how I could do it if I have a Reduce as well,,
> > since the configureIncrementalLoad generates an reduce.
> >
>


Store data in HBase with a MapReduce.

2014-06-26 Thread Guillermo Ortiz
I have a question.
I want to execute an MapReduce and the output of my reduce it's going to
store in HBase.

So, it's a MapReduce with an output which it's going to be stored in HBase.
I can do a Map and use HFileOutputFormat.configureIncrementalLoad(pJob,
table); but, I don't know how I could do it if I have a Reduce as well,,
since the configureIncrementalLoad generates an reduce.


Re: Does compression ever improve performance?

2014-06-14 Thread Guillermo Ortiz
I would like to see the times they got doing some scans or get with the
benchmark about compression and block code to figure out how much time to
save if your data are smaller but you have to decompress them.

El sábado, 14 de junio de 2014, Kevin O'dell 
escribió:

> Hi Jeremy,
>
>   I always recommend turning on snappy compression,  I have ~20%
> performance increases.
> On Jun 14, 2014 10:25 AM, "Ted Yu" >
> wrote:
>
> > You may have read Doug Meil's writeup where he tried out different
> > ColumnFamily
> > compressions :
> >
> > https://blogs.apache.org/hbase/
> >
> > Cheers
> >
> >
> > On Fri, Jun 13, 2014 at 11:33 AM, jeremy p <
> athomewithagroove...@gmail.com 
> > >
> > wrote:
> >
> > > Thank you -- I'll go ahead and try compression.
> > >
> > > --Jeremy
> > >
> > >
> > > On Fri, Jun 13, 2014 at 10:59 AM, Dima Spivak  >
> > > wrote:
> > >
> > > > I'd highly recommend it. In general, compressing your column families
> > > will
> > > > improve performance by reducing the resources required to get data
> from
> > > > disk (even when taking into account the CPU overhead of compressing
> and
> > > > decompressing).
> > > >
> > > > -Dima
> > > >
> > > >
> > > > On Fri, Jun 13, 2014 at 10:35 AM, jeremy p <
> > > athomewithagroove...@gmail.com 
> > > > >
> > > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > Right now, I'm not using compression on any of my tables, because
> our
> > > > data
> > > > > doesn't take up a huge amount of space.  However, I would turn on
> > > > > compression if there was a chance it would improve HBase's
> > performance.
> > > >  By
> > > > > performance, I'm talking about the speed with which HBase responds
> to
> > > > > requests and retrieves data.
> > > > >
> > > > > Should I turn compression on?
> > > > >
> > > > > --Jeremy
> > > > >
> > > >
> > >
> >
>


Delete rowKey in hexadecimal array bytes.

2014-06-09 Thread Guillermo Ortiz
Hi,

I'm generating key with SHA1, as it's a hex representation after generating
the keys, I use Hex.decode to save memory since I could store them in half
space.

I have a MapReduce process which deletes some of these keys, the problem
it's that it's that when I try to delete them, but I don't get it. If I
don't do the parse to Hex, it works.

So, For example, I put the keys in SHA like
b343664e210e7a7abff3625a005e65e2b0d4616 works, but if I parse this key with
Hex.decode to *\xB3CfN!\x0Ezz\xBF\xF3bZ\x00^e\xE2\xB0\ *column=l:dd,
timestamp=1384317115000  it doesn't.

I have been checked the code a lot but I think it's right, plus, if I
comments the decode to Hex it works.

Any clue about it? is there any problem with I am trying to??


Parallel Scan with TableMapReduceUtil

2014-05-15 Thread Guillermo Ortiz
I am processing data from HBase with a MapReduce. The input of my MapReduce
is a "full" scan of a table.

When I execute a full scan with TableMapReduceUtil, is this scan executed
in parallel, so all mappers get the data in parallel?? same way that if I
would execute many range scans with threads?


Re: Error loading SHA-1 keys with load bulk

2014-05-06 Thread Guillermo Ortiz
The error was that when I was emitting the , I was doing SHA about K,
not about the key in the Value.
The Value is a KeyValue and here's where I had to do the SHA1.


2014-05-02 0:42 GMT+02:00 Guillermo Ortiz :

> Yes, I do,
>
>
> job.setMapperClass(EventMapper.class);
> job.setMapOutputKeyClass(ImmutableBytesWritable.class);
> job.setMapOutputValueClass(KeyValue.class);
>
> FileOutputFormat.setOutputPath(job, hbasePath);
> HTable table = new HTable(jConf, MEM_TABLE_HBASE);
> HFileOutputFormat.configureIncrementalLoad(job, table);
>
>
> The error is happeing in a MRUnit, I don't know if it changes something
> about the behavior, because I had some troubles in the past for the same
> reason about the serialization in Hbase 0.96 and MRUnit.
> . Besides, in the setup of the MRUnit test I load some data in hbase with
> keys in sha1 and it works.
>
> El jueves, 1 de mayo de 2014, Jean-Daniel Cryans 
> escribió:
>
> Are you using HFileOutputFormat.configureIncrementalLoad() to set up the
>> partitioner and the reducers? That will take care of ordering your keys.
>>
>> J-D
>>
>>
>> On Thu, May 1, 2014 at 5:38 AM, Guillermo Ortiz > >wrote:
>>
>> > I have been looking at the code in HBase, but, I don't really understand
>> > what this error happens. Why can I put in HBase those keys?
>> >
>> >
>> > 2014-04-30 17:57 GMT+02:00 Guillermo Ortiz
>> > > > ');>
>> > >:
>> >
>> > > I'm using HBase with MapReduce to load a lot of data, so I have
>> decide to
>> > > do it with bulk load.
>> > >
>> > >
>> > > I parse my keys with SHA1, but when I try to load them, I got this
>> > > exception.
>> > >
>> > > java.io.IOException: Added a key not lexically larger than previous
>> >
>> key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E,
>> >
>> lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E
>> > >   at
>> >
>> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207)
>> > >   at
>> >
>> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324)
>> > >   at
>> >
>> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289)
>> > >   at
>> >
>> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206)
>> > >   at
>> >
>> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168)
>> > >   at
>> >
>> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
>> > >   at
>> >
>> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
>> > >   at
>> >
>> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
>> > >
>> > > I work with HBase 0.94.6. I have been loking for if I could define any
>> > reducer, since, I have defined no one. I have read something about
>> > KeyValueSortReducer but, I don'tknow if there's something that extends
>> > TableReducer or I'm lookging for a wrong way.
>> > >
>> > >
>> > >
>> >
>>
>


Re: Error loading SHA-1 keys with load bulk

2014-05-01 Thread Guillermo Ortiz
Yes, I do,


job.setMapperClass(EventMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);

FileOutputFormat.setOutputPath(job, hbasePath);
HTable table = new HTable(jConf, MEM_TABLE_HBASE);
HFileOutputFormat.configureIncrementalLoad(job, table);


The error is happeing in a MRUnit, I don't know if it changes something
about the behavior, because I had some troubles in the past for the same
reason about the serialization in Hbase 0.96 and MRUnit.
. Besides, in the setup of the MRUnit test I load some data in hbase with
keys in sha1 and it works.

El jueves, 1 de mayo de 2014, Jean-Daniel Cryans 
escribió:

> Are you using HFileOutputFormat.configureIncrementalLoad() to set up the
> partitioner and the reducers? That will take care of ordering your keys.
>
> J-D
>
>
> On Thu, May 1, 2014 at 5:38 AM, Guillermo Ortiz 
> 
> >wrote:
>
> > I have been looking at the code in HBase, but, I don't really understand
> > what this error happens. Why can I put in HBase those keys?
> >
> >
> > 2014-04-30 17:57 GMT+02:00 Guillermo Ortiz
> >  konstt2...@gmail.com 
> > ');>
> > >:
> >
> > > I'm using HBase with MapReduce to load a lot of data, so I have decide
> to
> > > do it with bulk load.
> > >
> > >
> > > I parse my keys with SHA1, but when I try to load them, I got this
> > > exception.
> > >
> > > java.io.IOException: Added a key not lexically larger than previous
> >
> key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E,
> >
> lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E
> > >   at
> >
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207)
> > >   at
> >
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324)
> > >   at
> >
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289)
> > >   at
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206)
> > >   at
> >
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168)
> > >   at
> >
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
> > >   at
> >
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
> > >   at
> >
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
> > >
> > > I work with HBase 0.94.6. I have been loking for if I could define any
> > reducer, since, I have defined no one. I have read something about
> > KeyValueSortReducer but, I don'tknow if there's something that extends
> > TableReducer or I'm lookging for a wrong way.
> > >
> > >
> > >
> >
>


Error loading SHA-1 keys with load bulk

2014-05-01 Thread Guillermo Ortiz
I have been looking at the code in HBase, but, I don't really understand
what this error happens. Why can I put in HBase those keys?


2014-04-30 17:57 GMT+02:00 Guillermo Ortiz

>:

> I'm using HBase with MapReduce to load a lot of data, so I have decide to
> do it with bulk load.
>
>
> I parse my keys with SHA1, but when I try to load them, I got this
> exception.
>
> java.io.IOException: Added a key not lexically larger than previous 
> key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E,
>  
> lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E
>   at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206)
>   at 
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168)
>   at 
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
>   at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
>   at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
>
> I work with HBase 0.94.6. I have been loking for if I could define any 
> reducer, since, I have defined no one. I have read something about 
> KeyValueSortReducer but, I don'tknow if there's something that extends 
> TableReducer or I'm lookging for a wrong way.
>
>
>


Error loading SHA-1 keys with load bulk

2014-04-30 Thread Guillermo Ortiz
I'm using HBase with MapReduce to load a lot of data, so I have decide to
do it with bulk load.


I parse my keys with SHA1, but when I try to load them, I got this
exception.

java.io.IOException: Added a key not lexically larger than previous
key=\x00(6e9e59f36a7ec2ac54635b2d353e53e677839046\x01l\x00\x00\x01E\xB3>\xC9\xC7\x0E,
lastkey=\x00(b313a9f1f57c8a07c81dc3221c6151cf3637506a\x01l\x00\x00\x01E\xAE\x18k\x87\x0E
at 
org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:207)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:324)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:289)
at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1206)
at 
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168)
at 
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)

I work with HBase 0.94.6. I have been loking for if I could define any
reducer, since, I have defined no one. I have read something about
KeyValueSortReducer but, I don'tknow if there's something that extends
TableReducer or I'm lookging for a wrong way.


Re: Weird behavior splitting regions

2014-04-15 Thread Guillermo Ortiz
I read the article, that's why I typed the question, because I didn't
understand the result I got.

Oh, yes!!, that's true, so silly.
I think some of the files are pretty small because the table has two
families and one of them is much smaller than the another one. So, it has
been splitted many  times. The big regions get a size close to 1Gb, but the
smaller regions has a final size pretty small because they have been
splitted a lot of times.

What I don't know, it's why HBase decides to split the table so late, not
when I create the table presplitted if not, two hours later or whatever.
Anyway, that's my error, I'm just curious about it.


2014-04-15 12:17 GMT+02:00 divye sheth :

> The default split policy in hbase0.94.x is IncreaseToUpperBound rather than
> ConstantSizeSplitPolicy which was the default in the older versions of
> hbase.
>
> Please refer to the link given below to understand how a
> IncreaseToUpperBoundSplitPolicy works:
> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
> check the auto-splitting section
>
> Hope this answers your question
>
> Thanks
> Divye Sheth
>
>
>
> On Tue, Apr 15, 2014 at 3:36 PM, Bharath Vissapragada <
> bhara...@cloudera.com
> > wrote:
>
> > >There're some new regions that they're just a some KBytes!. Why they are
> > so
> > small?? When does HBase decide to split? because it started to split two
> > hours later to create the table.
> >
> > When hbase does a split, it doesn't actually split at the disk/file
> level.
> > Its just a metadata operation which creates new regions that contain the
> > reference files that still point to old HFiles. That is the reason you
> find
> > KB size regions.
> >
> > >I thought major compaction just happen once at day and compact many
> files
> > per region. Data is always the same here, I don't inject new data.
> >
> > IIRC sometimes minor compactions get promoted to major compactions based
> on
> > some criteria, but I'll leave it for others to answer!
> >
> >
> >
> > On Tue, Apr 15, 2014 at 3:15 PM, Guillermo Ortiz  > >wrote:
> >
> > > I have a table in Hbase that sizes around 96Gb,
> > >
> > > I generate 4 regions of 30Gb. Some time, table starts to split because
> > the
> > > max size for region is 1Gb (I just realize of that, I'm going to change
> > it
> > > or create more pre-splits.).
> > >
> > > There're two things that I don't understand. how is it creating the
> > splits?
> > > right now I have 130 regions and growing. The problem is the size of
> the
> > > new regions:
> > >
> > > 1.7 M/hbase/filters/4ddbc34a2242e44c03121ae4608788a2
> > > 1.6 G/hbase/filters/548bdcec79cfe9a99fa57cb18f801be2
> > > 3.1 G/hbase/filters/58b50df089bd9d4d1f079f53238e060d
> > > 2.5 M/hbase/filters/5a0d6d5b3b8faf67889ac5f5c2947c4f
> > > 1.9 G/hbase/filters/5b0a35b5735a473b7e804c4b045ce374
> > > 883.4 M  /hbase/filters/5b49c68e305b90d87b3c64a0eee60b8c
> > > 1.7 M/hbase/filters/5d43fd7ea9808ab7d2f2134e80fbfae7
> > > 632.4 M  /hbase/filters/5f04c7cd450d144f88fb4c7cff0796a2
> > >
> > > There're some new regions that they're just a some KBytes!. Why they
> are
> > so
> > > small?? When does HBase decide to split? because it started to split
> two
> > > hours later to create the table.
> > >
> > > One, I create the table and insert data, I don't insert new data or
> > modify
> > > them.
> > >
> > >
> > > Another interested point it's why there're major compactions:
> > > 2014-04-15 11:33:47,400 INFO
> org.apache.hadoop.hbase.regionserver.Store:
> > > Renaming compacted file at
> > >
> > >
> >
> hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/.tmp/df90c260cb4e4256a153dd178244f04c
> > > to
> > >
> > >
> >
> hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/d/df90c260cb4e4256a153dd178244f04c
> > > 2014-04-15 11:33:47,407 INFO
> > > org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded ROWCOL
> > > (CompoundBloomFilter) metadata for df90c260cb4e4256a153dd178244f04c
> > > 2014-04-15 11:33:47,416 INFO
> org.apache.hadoop.hbase.regionserver.Store:*
> > > Completed major compaction of 1 file*(s) in d of
> > > filters,51,1397554175140.ef994715505054299ede8c48c600cea4. into
> > > df90c260cb4e4256a153dd178244f04c, size=789.

Weird behavior splitting regions

2014-04-15 Thread Guillermo Ortiz
I have a table in Hbase that sizes around 96Gb,

I generate 4 regions of 30Gb. Some time, table starts to split because the
max size for region is 1Gb (I just realize of that, I'm going to change it
or create more pre-splits.).

There're two things that I don't understand. how is it creating the splits?
right now I have 130 regions and growing. The problem is the size of the
new regions:

1.7 M/hbase/filters/4ddbc34a2242e44c03121ae4608788a2
1.6 G/hbase/filters/548bdcec79cfe9a99fa57cb18f801be2
3.1 G/hbase/filters/58b50df089bd9d4d1f079f53238e060d
2.5 M/hbase/filters/5a0d6d5b3b8faf67889ac5f5c2947c4f
1.9 G/hbase/filters/5b0a35b5735a473b7e804c4b045ce374
883.4 M  /hbase/filters/5b49c68e305b90d87b3c64a0eee60b8c
1.7 M/hbase/filters/5d43fd7ea9808ab7d2f2134e80fbfae7
632.4 M  /hbase/filters/5f04c7cd450d144f88fb4c7cff0796a2

There're some new regions that they're just a some KBytes!. Why they are so
small?? When does HBase decide to split? because it started to split two
hours later to create the table.

One, I create the table and insert data, I don't insert new data or modify
them.


Another interested point it's why there're major compactions:
2014-04-15 11:33:47,400 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming compacted file at
hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/.tmp/df90c260cb4e4256a153dd178244f04c
to
hdfs://m01.cluster:8020/hbase/filters/ef994715505054299ede8c48c600cea4/d/df90c260cb4e4256a153dd178244f04c
2014-04-15 11:33:47,407 INFO
org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded ROWCOL
(CompoundBloomFilter) metadata for df90c260cb4e4256a153dd178244f04c
2014-04-15 11:33:47,416 INFO org.apache.hadoop.hbase.regionserver.Store:*
Completed major compaction of 1 file*(s) in d of
filters,51,1397554175140.ef994715505054299ede8c48c600cea4. into
df90c260cb4e4256a153dd178244f04c, size=789.1 M; total size for store is
789.1 M
2014-04-15 11:33:47,416 INFO
org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest:
completed compaction:
regionName=filters,51,1397554175140.ef994715505054299ede8c48c600cea4.,
storeName=d, fileCount=1, fileSize=1.5 G, priority=6, time=414761474510060;
duration=7sec

I thought major compaction just happen once at day and compact many files
per region. Data is always the same here, I don't inject new data.


I'm working with 0.94.6 CDH44. I'm going to change the size of the regions,
but, I would like to understand why things happen.

Thank you.


Re: How to generate a large dataset quickly.

2014-04-14 Thread Guillermo Ortiz
But, if I'm using bulkLoad, I think this method bypasses the WAL, right?
I have no idea about the autoFlush, is it still necessary to set to false
or the bulkload does some kind of magic with that as well??

I could try to do the loads without bulkLoad, but, I don't think that's the
problem, maybe, it's just the time the cluster needs, although, it seems
like too much time.



2014-04-14 22:51 GMT+02:00 lars hofhansl :

> +1 to what Vladimir said.
> For the Puts in question you can also disable the write ahead log (WAL)
> and issue a flush on the table after your ingest.
>
> -- Lars
>
>
> - Original Message -
> From: Vladimir Rodionov 
> To: "user@hbase.apache.org" 
> Cc:
> Sent: Monday, April 14, 2014 11:15 AM
> Subject: RE: How to generate a large dataset quickly.
>
> There is no need to run M/R unless your cluster is large (very large)
> Single multithreaded client can easily ingest 10s of thousands rows per
> sec.
> Check YCSB benchmark tool, for example.
>
> Make sure you disable both region splitting and major compaction during
> data ingestion
> and pre-split regions accordingly to improve overall performance.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodio...@carrieriq.com
>
> 
>
> From: Ted Yu [yuzhih...@gmail.com]
> Sent: Monday, April 14, 2014 9:16 AM
> To: user@hbase.apache.org
> Subject: Re: How to generate a large dataset quickly.
>
> I looked at revision history for HFileOutputFormat.java
> There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
> affect throughput much.
>
> If you can use ganglia (or some similar tool) to pinpoint what caused the
> low ingest rate, that would give us more clue.
>
> BTW Is upgrading to newer release, such as 0.98.1 (which contains
> HBASE-8755), an option for you ?
>
> Cheers
>
>
> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz  >wrote:
>
> > I'm using. 0.94.6-cdh4.4.0,
> >
> > I use the bulkload:
> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> > FileOutputFormat.setOutputPath(job, hbasePath);
> > HTable table = new HTable(jConf, HBASE_TABLE);
> > HFileOutputFormat.configureIncrementalLoad(job, table);
> >
> > It seems that it takes really long time when it starts to execute the
> Puts
> > to HBase in the reduce phase.
> >
> >
> >
> > 2014-04-14 14:35 GMT+02:00 Ted Yu :
> >
> > > Which hbase release did you run mapreduce job ?
> > >
> > > Cheers
> > >
> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz 
> > wrote:
> > >
> > > > I want to create a large dateset for HBase with different versions
> and
> > > > number of rows. It's about 10M rows and 100 versions to do some
> > > benchmarks.
> > > >
> > > > What's the fastest way to create it?? I'm generating the dataset
> with a
> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > > around
> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > > MapReduces write the output and when transfer the output to the
> > Reduces.
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or notificati...@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>


Re: How to generate a large dataset quickly.

2014-04-14 Thread Guillermo Ortiz
Are there some benchmark about how long could it takes to insert data in
HBase to have a reference?
The output of my Mapper has 3.2mill. output. So, I execute 3.2Mill of Put's
in HBase.

Well, data has to be copied and sent to the reducers, but with a network of
1Gb it shouldn't take too much time. I'll check Ganglia.


2014-04-14 18:16 GMT+02:00 Ted Yu :

> I looked at revision history for HFileOutputFormat.java
> There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
> affect throughput much.
>
> If you can use ganglia (or some similar tool) to pinpoint what caused the
> low ingest rate, that would give us more clue.
>
> BTW Is upgrading to newer release, such as 0.98.1 (which contains
> HBASE-8755), an option for you ?
>
> Cheers
>
>
> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz  >wrote:
>
> > I'm using. 0.94.6-cdh4.4.0,
> >
> > I use the bulkload:
> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> > FileOutputFormat.setOutputPath(job, hbasePath);
> > HTable table = new HTable(jConf, HBASE_TABLE);
> > HFileOutputFormat.configureIncrementalLoad(job, table);
> >
> > It seems that it takes really long time when it starts to execute the
> Puts
> > to HBase in the reduce phase.
> >
> >
> >
> > 2014-04-14 14:35 GMT+02:00 Ted Yu :
> >
> > > Which hbase release did you run mapreduce job ?
> > >
> > > Cheers
> > >
> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz 
> > wrote:
> > >
> > > > I want to create a large dateset for HBase with different versions
> and
> > > > number of rows. It's about 10M rows and 100 versions to do some
> > > benchmarks.
> > > >
> > > > What's the fastest way to create it?? I'm generating the dataset
> with a
> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > > around
> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > > MapReduces write the output and when transfer the output to the
> > Reduces.
> > >
> >
>


Re: How to generate a large dataset quickly.

2014-04-14 Thread Guillermo Ortiz
I'm using. 0.94.6-cdh4.4.0,

I use the bulkload:
FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
FileOutputFormat.setOutputPath(job, hbasePath);
HTable table = new HTable(jConf, HBASE_TABLE);
HFileOutputFormat.configureIncrementalLoad(job, table);

It seems that it takes really long time when it starts to execute the Puts
to HBase in the reduce phase.



2014-04-14 14:35 GMT+02:00 Ted Yu :

> Which hbase release did you run mapreduce job ?
>
> Cheers
>
> On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz  wrote:
>
> > I want to create a large dateset for HBase with different versions and
> > number of rows. It's about 10M rows and 100 versions to do some
> benchmarks.
> >
> > What's the fastest way to create it?? I'm generating the dataset with a
> > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> around
> > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > MapReduces write the output and when transfer the output to the Reduces.
>


How to generate a large dataset quickly.

2014-04-14 Thread Guillermo Ortiz
I want to create a large dateset for HBase with different versions and
number of rows. It's about 10M rows and 100 versions to do some benchmarks.

What's the fastest way to create it?? I'm generating the dataset with a
Mapreduce of 100.000rows and 10verions. It takes 17minutes and size around
7Gb. I don't know if I could do it quickly. The bottleneck is when
MapReduces write the output and when transfer the output to the Reduces.


Re: Lease exception when I execute large scan with filters.

2014-04-12 Thread Guillermo Ortiz
gt;>> Another little question is, when the filter I'm using, Do I check
> all the
> >>>>>> versions? or just the newest? Because, I'm wondering if when I do a
> scan
> >>>>>> over all the table, I look for the value "5" in all the dataset or
> I'm
> >>>>>> just
> >>>>>> looking for in one newest version of each value.
> >>>>>>
> >>>>>>
> >>>>>> On 10/04/14 16:52, gortiz wrote:
> >>>>>>
> >>>>>> I was trying to check the behaviour of HBase. The cluster is a
> group of
> >>>>>>> old computers, one master, five slaves, each one with 2Gb, so,
> 12gb in
> >>>>>>> total.
> >>>>>>> The table has a column family with 1000 columns and each column
> with
> >>>>>>> 100
> >>>>>>> versions.
> >>>>>>> There's another column faimily with four columns an one image of
> 100kb.
> >>>>>>>   (I've tried without this column family as well.)
> >>>>>>> The table is partitioned manually in all the slaves, so data are
> >>>>>>> balanced
> >>>>>>> in the cluster.
> >>>>>>>
> >>>>>>> I'm executing this sentence *scan 'table1', {FILTER =>
> "ValueFilter(=,
> >>>>>>> 'binary:5')"* in HBase 0.94.6
> >>>>>>> My time for lease and rpc is three minutes.
> >>>>>>> Since, it's a full scan of the table, I have been playing with the
> >>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of
> >>>>>>> it). I
> >>>>>>> thought that it was going to have too much calls to the GC. I'm not
> >>>>>>> sure
> >>>>>>> about this point.
> >>>>>>>
> >>>>>>> I know that it's not the best way to use HBase, it's just a test. I
> >>>>>>> think
> >>>>>>> that it's not working because the hardware isn't enough, although,
> I
> >>>>>>> would
> >>>>>>> like to try some kind of tunning to improve it.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/04/14 14:21, Ted Yu wrote:
> >>>>>>>
> >>>>>>> Can you give us a bit more information:
> >>>>>>>> HBase release you're running
> >>>>>>>> What filters are used for the scan
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz  wrote:
> >>>>>>>>
> >>>>>>>>   I got this error when I execute a full scan with filters about a
> >>>>>>>> table.
> >>>>>>>>
> >>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.
> >>>>>>>>> regionserver.LeaseException:
> >>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
> >>>>>>>>> '-4165751462641113359' does not exist
> >>>>>>>>>  at
> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  at org.apache.hadoop.hbase.regionserver.HRegionServer.
> >>>>>>>>> next(HRegionServer.java:2482)
> >>>>>>>>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> >>>>>>>>>  at sun.reflect.NativeMethodAccessorImpl.invoke(
> >>>>>>>>> NativeMethodAccessorImpl.java:39)
> >>>>>>>>>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> >>>>>>>>> DelegatingMethodAccessorImpl.java:25)
> >>>>>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>>>>>>  at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(
> >>>>>>>>> WritableRpcEngine.java:320)
> >>>>>>>>>  at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(
> >>>>>>>>> HBaseServer.java:1428)
> >>>>>>>>>
> >>>>>>>>> I have read about increase the lease time and rpc time, but it's
> not
> >>>>>>>>> working.. what else could I try?? The table isn't too big. I have
> >>>>>>>>> been
> >>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I
> >>>>>>>>> didn't see
> >>>>>>>>> anything weird. I tried as well to try with a couple of caching
> >>>>>>>>> values.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>> --
> >>>>>> *Guillermo Ortiz*
> >>>>>> /Big Data Developer/
> >>>>>>
> >>>>>> Telf.: +34 917 680 490<
> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
> >
> >>>>>> Fax: +34 913 833 301<
> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
> >
> >>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> >>>>>>
> >>>>>> _http://www.bidoop.es_
> >>>>>>
> >>>>>>
> >>>>>>
> >>> --
> >>> *Guillermo Ortiz*
> >>> /Big Data Developer/
> >>>
> >>> Telf.: +34 917 680 490<
> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
> >
> >>> Fax: +34 913 833 301<
> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
> >
> >>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> >>>
> >>> _http://www.bidoop.es_
> >>>
> >>>
> >
> >
> > --
> > *Guillermo Ortiz*
> > /Big Data Developer/
> >
> > Telf.: +34 917 680 490
> > Fax: +34 913 833 301
> > C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> >
> > _http://www.bidoop.es_
> >
>
>


Re: Lease exception when I execute large scan with filters.

2014-04-11 Thread Guillermo Ortiz
Okay, thank you, I'll check it this Monday. I didn't know that Scan checks
all the versions.
So, I was checking each column and each version although it just showed me
the newest version because I didn't indicate anything about the VERSIONS
attribute. It makes sense that it takes so long.


2014-04-11 16:57 GMT+02:00 Ted Yu :

> In your previous example:
> scan 'table1', {FILTER => "ValueFilter(=, 'binary:5')"}
>
> there was no expression w.r.t. timestamp. See the following javadoc from
> Scan.java:
>
>  * To only retrieve columns within a specific range of version timestamps,
>
>  * execute {@link #setTimeRange(long, long) setTimeRange}.
>
>  * 
>
>  * To only retrieve columns with a specific timestamp, execute
>
>  * {@link #setTimeStamp(long) setTimestamp}.
>
> You can use one of the above methods to make your scan more selective.
>
>
> ValueFilter#filterKeyValue(Cell) doesn't utilize advanced feature of
> ReturnCode. You can refer to:
>
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.ReturnCode.html
>
> You can take a look at SingleColumnValueFilter#filterKeyValue() for example
> of how various ReturnCode's are used to speed up scan.
>
> Cheers
>
>
> On Fri, Apr 11, 2014 at 8:40 AM, Guillermo Ortiz  >wrote:
>
> > I read something interesting about it in HBase TDG.
> >
> > Page 344:
> > The StoreScanner class combines the store files and memstore that the
> > Store instance
> > contains. It is also where the exclusion happens, based on the Bloom
> > filter, or the timestamp. If you are asking for versions that are not
> more
> > than 30 minutes old, for example, you can skip all storage files that are
> > older than one hour: they will not contain anything of interest. See "Key
> > Design" on page 357 for details on the exclusion, and how to make use of
> > it.
> >
> > So, I guess that it doesn't have to read all the HFiles?? But, I don't
> know
> > if HBase really uses the timestamp of each row or the date of the file. I
> > guess when I execute the scan, it reads everything, but, I don't know
> why.
> > I think there's something else that I don't see so that everything works
> to
> > me.
> >
> >
> > 2014-04-11 13:05 GMT+02:00 gortiz :
> >
> > > Sorry, I didn't get it why it should read all the timestamps and not
> just
> > > the newest it they're sorted and you didn't specific any timestamp in
> > your
> > > filter.
> > >
> > >
> > >
> > > On 11/04/14 12:13, Anoop John wrote:
> > >
> > >> In the storage layer (HFiles in HDFS) all versions of a particular
> cell
> > >> will be staying together.  (Yes it has to be lexicographically ordered
> > >> KVs). So during a scan we will have to read all the version data.  At
> > this
> > >> storage layer it doesn't know the versions stuff etc.
> > >>
> > >> -Anoop-
> > >>
> > >> On Fri, Apr 11, 2014 at 3:33 PM, gortiz  wrote:
> > >>
> > >>  Yes, I have tried with two different values for that value of
> versions,
> > >>> 1000 and maximum value for integers.
> > >>>
> > >>> But, I want to keep those versions. I don't want to keep just 3
> > versions.
> > >>> Imagine that I want to record a new version each minute and store a
> > day,
> > >>> those are 1440 versions.
> > >>>
> > >>> Why is HBase going to read all the versions?? , I thought, if you
> don't
> > >>> indicate any versions it's just read the newest and skip the rest. It
> > >>> doesn't make too much sense to read all of them if data is sorted,
> plus
> > >>> the
> > >>> newest version is stored in the top.
> > >>>
> > >>>
> > >>>
> > >>> On 11/04/14 11:54, Anoop John wrote:
> > >>>
> > >>>What is the max version setting u have done for ur table cf?
>  When u
> > >>>> set
> > >>>> some a value, HBase has to keep all those versions.  During a scan
> it
> > >>>> will
> > >>>> read all those versions. In 94 version the default value for the max
> > >>>> versions is 3.  I guess you have set some bigger value.   If u have
> > not,
> > >>>> mind testing after a major compaction?
&g

Re: Lease exception when I execute large scan with filters.

2014-04-11 Thread Guillermo Ortiz
gt;>>>>> I'm generating again the dataset with a bigger blocksize (previously
>>>>>> was
>>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning and
>>>>>> baching parameters, but I don't think they're going to affect too
>>>>>> much.
>>>>>>
>>>>>> Another test I want to do, it's generate the same dataset with just
>>>>>> 100versions, It should spend around the same time, right? Or am I
>>>>>> wrong?
>>>>>>
>>>>>> On 10/04/14 18:08, Ted Yu wrote:
>>>>>>
>>>>>>   It should be newest version of each value.
>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz  wrote:
>>>>>>>
>>>>>>> Another little question is, when the filter I'm using, Do I check all
>>>>>>> the
>>>>>>>
>>>>>>>versions? or just the newest? Because, I'm wondering if when I do
>>>>>>>> a
>>>>>>>> scan
>>>>>>>> over all the table, I look for the value "5" in all the dataset or
>>>>>>>> I'm
>>>>>>>> just
>>>>>>>> looking for in one newest version of each value.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/04/14 16:52, gortiz wrote:
>>>>>>>>
>>>>>>>> I was trying to check the behaviour of HBase. The cluster is a group
>>>>>>>> of
>>>>>>>>
>>>>>>>>  old computers, one master, five slaves, each one with 2Gb, so, 12gb
>>>>>>>>> in
>>>>>>>>> total.
>>>>>>>>> The table has a column family with 1000 columns and each column
>>>>>>>>> with
>>>>>>>>> 100
>>>>>>>>> versions.
>>>>>>>>> There's another column faimily with four columns an one image of
>>>>>>>>> 100kb.
>>>>>>>>> (I've tried without this column family as well.)
>>>>>>>>> The table is partitioned manually in all the slaves, so data are
>>>>>>>>> balanced
>>>>>>>>> in the cluster.
>>>>>>>>>
>>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER =>
>>>>>>>>> "ValueFilter(=,
>>>>>>>>> 'binary:5')"* in HBase 0.94.6
>>>>>>>>> My time for lease and rpc is three minutes.
>>>>>>>>> Since, it's a full scan of the table, I have been playing with the
>>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of
>>>>>>>>> it). I
>>>>>>>>> thought that it was going to have too much calls to the GC. I'm not
>>>>>>>>> sure
>>>>>>>>> about this point.
>>>>>>>>>
>>>>>>>>> I know that it's not the best way to use HBase, it's just a test. I
>>>>>>>>> think
>>>>>>>>> that it's not working because the hardware isn't enough, although,
>>>>>>>>> I
>>>>>>>>> would
>>>>>>>>> like to try some kind of tunning to improve it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/04/14 14:21, Ted Yu wrote:
>>>>>>>>>
>>>>>>>>> Can you give us a bit more information:
>>>>>>>>>
>>>>>>>>>  HBase release you're running
>>>>>>>>>> What filters are used for the scan
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz  wrote:
>>>>>>>>>>
>>>>>>>>>> I got this error when I execute a full scan with filters
>>>>>>>>>> about a
>>>>>>>>>> table.
>>>>>>>>>>
>>>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.
>>>>>>>>>>
>>>>>>>>>>> regionserver.LeaseException:
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
>>>>>>>>>>> '-4165751462641113359' does not exist
>>>>>>>>>>>at org.apache.hadoop.hbase.regionserver.Leases.
>>>>>>>>>>> removeLease(Leases.java:231)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>at org.apache.hadoop.hbase.regionserver.HRegionServer.
>>>>>>>>>>> next(HRegionServer.java:2482)
>>>>>>>>>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>>> Method)
>>>>>>>>>>>at sun.reflect.NativeMethodAccessorImpl.invoke(
>>>>>>>>>>> NativeMethodAccessorImpl.java:39)
>>>>>>>>>>>at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>>>>>>>>> DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>at org.apache.hadoop.hbase.ipc.
>>>>>>>>>>> WritableRpcEngine$Server.call(
>>>>>>>>>>> WritableRpcEngine.java:320)
>>>>>>>>>>>at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(
>>>>>>>>>>> HBaseServer.java:1428)
>>>>>>>>>>>
>>>>>>>>>>> I have read about increase the lease time and rpc time, but it's
>>>>>>>>>>> not
>>>>>>>>>>> working.. what else could I try?? The table isn't too big. I have
>>>>>>>>>>> been
>>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I
>>>>>>>>>>> didn't see
>>>>>>>>>>> anything weird. I tried as well to try with a couple of caching
>>>>>>>>>>> values.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>> *Guillermo Ortiz*
>>>>>>>> /Big Data Developer/
>>>>>>>>
>>>>>>>> Telf.: +34 917 680 490<https://mail.google.com/
>>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>>>> Fax: +34 913 833 301<https://mail.google.com/
>>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>>>>
>>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>>>>>>
>>>>>>>> _http://www.bidoop.es_
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>> *Guillermo Ortiz*
>>>>> /Big Data Developer/
>>>>>
>>>>> Telf.: +34 917 680 490<https://mail.google.com/
>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>> Fax: +34 913 833 301<https://mail.google.com/
>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
>>>>>
>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>>>
>>>>> _http://www.bidoop.es_
>>>>>
>>>>>
>>>>>
>>>>>  --
>>> *Guillermo Ortiz*
>>> /Big Data Developer/
>>>
>>> Telf.: +34 917 680 490<https://mail.google.com/mail/
>>> u/0/html/compose/static_files/blank_quirks.html#>
>>> Fax: +34 913 833 301<https://mail.google.com/mail/
>>> u/0/html/compose/static_files/blank_quirks.html#>
>>>   C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>
>>> _http://www.bidoop.es_
>>>
>>>
>>>
>
> --
> *Guillermo Ortiz*
> /Big Data Developer/
>
> Telf.: +34 917 680 490
> Fax: +34 913 833 301
> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>
> _http://www.bidoop.es_
>
>