Re: bulk loading - region creation/pre-spliting

2012-08-29 Thread Adrien Mogenet
If you plan pre-splitting regions, look at the classes exposed by
RegionSplitter 
(http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/RegionSplitter.html).

Are you keys String representing hexadecimal values or are they really
binary encoded ? (I mean, \xFF\x03 and not "F3" for example)

On Wed, Aug 29, 2012 at 4:46 PM, Oleg Ruchovets  wrote:
> Hi ,
> I have bulk loading job.
> My job is for  User data aggregation.
> Before I run Bulk Loading aggregation I want  to create regions
> UserID looks like this  :
>
> 943e2c6d66d732e06ab257903f240d27
>
>
> a0617cb2b964690a39b0d93e7fe2f021
>
>
> ac85b4dee6d8c8495d61201234dfb73e
>
>
> b8416d5e0fe2a1228f042dffa8d291e2
>
>
> c422be9e75d28d9afe0f1f98f59cda92
>
>
> fe6b0ad1822455958586e240eb75c1d7
>
>
> 1790ee2ce4487d976cd9eddd036275d6
>
>
> 344c3de9449a9522d2a4de8bb9e81b02
>
>
> 4fcccd6790aec3056f897741b467d08c
>
>
> 6b67dc1922e4fc0cd6fa31f64bd51ef3
>
>
> 87f1374e7c900a243450f5b5c3a2b2b9
>
>
> a4180db6a62f300cdecf77310f0010ac
>
>
>
> I have ~ 50.000.000 users. I run aggregation on daily basis and per day I
> have ~ 30 regions.
> So The objective is to create 30 regions with more or less equal
> distributions.
>
> The question is : What is the best practice to verify start / end key for
> regions in my use case?
>
> Thanks in advance
> Oleg.

-- 
AM


Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-08-29 Thread dva



Eran Kutner wrote:
> 
> Hi,
> We're seeing occasional regionserver crashes during heavy write operations
> to Hbase (at the reduce phase of large M/R jobs). I have increased the
> file
> descriptors, HDFS xceivers, HDFS threads to the recommended settings and
> actually way above.
> 
> Here is an example of the HBase log (showing only errors):
> 
> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
> block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2986)
> 
> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer
> Exception: java.io.InterruptedIOException: Interruped while waiting for IO
> on channel java.nio.channels.SocketChannel[connected
> local=/10.1.104.9:59642remote=/
> 10.1.104.9:50010]. 0 millis timeout left.
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2848)
> 
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
> 10.1.104.6:50010
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 in pipeline
> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
> 10.1.104.6:50010
> 2012-05-10 03:48:30,174 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
> regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> java.io.InterruptedIOException: Aborting compaction of store properties in
> region
> gs_users,611|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
> because user requested stop.
> at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
> at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> 
> 
> This is from 10.1.104.9 (same machine running the region server that
> crashed):
> 2012-05-10 03:31:16,785 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: /
> 10.1.104.9:50010
> 2012-05-10 03:35:39,000 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Connection reset
> 2012-05-10 03:35:39,052 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_-8928911185099340956_5189425
> java.nio.channels.ClosedByInterruptException
> 2012-05-10 03:35:39,053 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-8928911185099340956_5189425 received exception java.io.IOException:
> Interrupted receiveBlock
> 2012-05-10 03:35:39,055 ERROR
> org.apache.hadoop.security.UserGroupInformation:
> PriviledgedActionException
> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> blk_-8928911185099340956_5189425 length is 24384000 does not match block
> file length 24449024
> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 50020, call
> startBlockRecovery(blk_-8928911185099340956_5189425) from
> 10.1.104.8:50251:
> error: java.io.IOException: Block blk_-8928911185099340956_5189425 length
> is 24384000 does not match block file length 24449024
> java.io.IOExc

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-08-29 Thread dva



Eran Kutner wrote:
> 
> Hi,
> We're seeing occasional regionserver crashes during heavy write operations
> to Hbase (at the reduce phase of large M/R jobs). I have increased the
> file
> descriptors, HDFS xceivers, HDFS threads to the recommended settings and
> actually way above.
> 
> Here is an example of the HBase log (showing only errors):
> 
> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
> block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2986)
> 
> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer
> Exception: java.io.InterruptedIOException: Interruped while waiting for IO
> on channel java.nio.channels.SocketChannel[connected
> local=/10.1.104.9:59642remote=/
> 10.1.104.9:50010]. 0 millis timeout left.
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2848)
> 
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
> 10.1.104.6:50010
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 in pipeline
> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
> 10.1.104.6:50010
> 2012-05-10 03:48:30,174 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
> regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> java.io.InterruptedIOException: Aborting compaction of store properties in
> region
> gs_users,611|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
> because user requested stop.
> at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
> at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> 
> 
> This is from 10.1.104.9 (same machine running the region server that
> crashed):
> 2012-05-10 03:31:16,785 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: /
> 10.1.104.9:50010
> 2012-05-10 03:35:39,000 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Connection reset
> 2012-05-10 03:35:39,052 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_-8928911185099340956_5189425
> java.nio.channels.ClosedByInterruptException
> 2012-05-10 03:35:39,053 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-8928911185099340956_5189425 received exception java.io.IOException:
> Interrupted receiveBlock
> 2012-05-10 03:35:39,055 ERROR
> org.apache.hadoop.security.UserGroupInformation:
> PriviledgedActionException
> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> blk_-8928911185099340956_5189425 length is 24384000 does not match block
> file length 24449024
> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 50020, call
> startBlockRecovery(blk_-8928911185099340956_5189425) from
> 10.1.104.8:50251:
> error: java.io.IOException: Block blk_-8928911185099340956_5189425 length
> is 24384000 does not match block file length 24449024
> java.io.IOExc

Re: Inconsistent scan performance with caching set to 1

2012-08-29 Thread Stack
On Wed, Aug 29, 2012 at 10:42 AM, Wayne  wrote:
> This is basically a read bug/performance problem. The execution path
> followed when the caching is used up is not consistent with the initial
> execution path/performance. Can anyone help shed light on this? Was there
> any changes to 0.94 to introduce this (we have not tested on other
> versions)? Any help or advice would be appreciated. As it is stands we are
> looking to have to reverse engineer every aspect of a read from both the
> hbase client and server components to find and fix.
>
> One additional lead is that not all rows behave like this. Only certain
> small rows seem to do this consistently. Most of our rows are larger and do
> not have this behavior.
>

Nagles?  (https://issues.apache.org/jira/browse/HBASE-2125)
St.Ack


Re: md5 hash key and splits

2012-08-29 Thread Stack
On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia  wrote:
> On Wed, Aug 29, 2012 at 9:19 PM, Stack  wrote:
>
>>  On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia 
>> wrote:
>> > If I use md5 hash + timestamp rowkey would hbase automatically detect the
>> > difference in ranges and peforms split? How does split work in such cases
>> > or is it still advisable to manually split the regions.
>>
>
> What logic would you recommend to split the table into multiple regions
> when using md5 hash?
>

Its hard to know how well your inserts will spread over the md5
namespace ahead of time.  You could try sampling or just let HBase
take care of the splits for you (Is there a problem w/ your letting
HBase do the splits?)

St.Ack


Re: md5 hash key and splits

2012-08-29 Thread Mohit Anchlia
On Wed, Aug 29, 2012 at 9:19 PM, Stack  wrote:

>  On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia 
> wrote:
> > If I use md5 hash + timestamp rowkey would hbase automatically detect the
> > difference in ranges and peforms split? How does split work in such cases
> > or is it still advisable to manually split the regions.
>

What logic would you recommend to split the table into multiple regions
when using md5 hash?


> Yes.
>
> On how split works, when a region hits the maximum configured size, it
> splits in two.
>
> Manual splitting can be useful when you know your distribution and
> you'd save on hbase doing it for you.  It can speed up bulk loads for
> instance.
>
> St.Ack
>


Re: md5 hash key and splits

2012-08-29 Thread Stack
On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia  wrote:
> If I use md5 hash + timestamp rowkey would hbase automatically detect the
> difference in ranges and peforms split? How does split work in such cases
> or is it still advisable to manually split the regions.

Yes.

On how split works, when a region hits the maximum configured size, it
splits in two.

Manual splitting can be useful when you know your distribution and
you'd save on hbase doing it for you.  It can speed up bulk loads for
instance.

St.Ack


Re: setTimeRange and setMaxVersions seem to be inefficient

2012-08-29 Thread Jerry Lam
Hi Ted:

Sure, will do.
I also implement the reset method to set previousIncludedQualifier to null
for the next row to come.

Best Regards,

Jerry

On Wed, Aug 29, 2012 at 1:47 PM, Ted Yu  wrote:

> Jerry:
> Remember to also implement:
>
> +  @Override
> +  public KeyValue getNextKeyHint(KeyValue currentKV) {
>
> You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL.
>
> Cheers
>
> On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam  wrote:
>
> > Hi Lars:
> >
> > Thanks for spending time discussing this with me. I appreciate it.
> >
> > I tried to implement the setMaxVersions(1) inside the filter as follows:
> >
> > @Override
> > public ReturnCode filterKeyValue(KeyValue kv) {
> >
> > // check if the same qualifier as the one that has been included
> > previously. If yes, jump to next column
> > if (previousIncludedQualifier != null &&
> > Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
> > previousIncludedQualifier = null;
> > return ReturnCode.NEXT_COL;
> > }
> > // another condition that makes the jump further using HINT
> > if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
> > LOG.info("Matched Found.");
> > return ReturnCode.SEEK_NEXT_USING_HINT;
> >
> > }
> > // include this to the result and keep track of the included
> > qualifier so the next version of the same qualifier will be excluded
> > previousIncludedQualifier = kv.getQualifier();
> > return ReturnCode.INCLUDE;
> > }
> >
> > Does this look reasonable or there is a better way to achieve this? It
> > would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case
> though.
> >
> > Best Regards,
> >
> > Jerry
> >
> >
> > On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl 
> > wrote:
> >
> > > Hi Jerry,
> > >
> > > my answer will be the same again:
> > > Some folks will want the max versions set by the client to be before
> > > filters and some folks will want it to restrict the end result.
> > > It's not possible to have it both ways. Your filter needs to do the
> right
> > > thing.
> > >
> > >
> > > There's a lot of discussion around this in HBASE-5104.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > 
> > >  From: Jerry Lam 
> > > To: user@hbase.apache.org; lars hofhansl 
> > > Sent: Tuesday, August 28, 2012 1:52 PM
> > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi Lars:
> > >
> > > I see. Please refer to the inline comment below.
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl 
> > > wrote:
> > >
> > > > What I was saying was: It depends. :)
> > > >
> > > > First off, how do you get to 1000 versions? In 0.94++ older version
> are
> > > > pruned upon flush, so you need 333 flushes (assuming 3 versions on
> the
> > > CF)
> > > > to get 1000 versions.
> > > >
> > >
> > > I forgot that the default number of version to keep is 3. If this is
> what
> > > people use most of the time, yes you are right for this type of
> scenarios
> > > where the number of version per column to keep is small.
> > >
> > > By that time some compactions will have happened and you're back to
> close
> > > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files
> > you
> > > > have).
> > > >
> > > > Now, if you have that many version because because you set
> > VERSIONS=>1000
> > > > in your CF... Then imagine you have 100 columns with 1000 versions
> > each.
> > > >
> > >
> > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> > > versioning myself)
> > >
> > > In your scenario below you'd do 10 comparisons if the filter would
> be
> > > > evaluated after the version counting. But only 1100 with the current
> > > code.
> > > > (or at least in that ball park)
> > > >
> > >
> > > This is where I don't quite understand what you mean.
> > >
> > > if the framework counts the number of ReturnCode.INCLUDE and then stops
> > > feeding the KeyValue into the filterKeyValue method after it reaches
> the
> > > count specified in setMaxVersions (i.e. 1 for the case we discussed),
> > > should then be just 100 comparisons only (at most) instead of 1100
> > > comparisons? Maybe I don't understand how the current way is doing...
> > >
> > >
> > >
> > > >
> > > > The gist is: One can construct scenarios where one approach is better
> > > than
> > > > the other. Only one order is possible.
> > > > If you write a custom filter and you care about these things you
> should
> > > > use the seek hints.
> > > >
> > > > -- Lars
> > > >
> > > >
> > > > - Original Message -
> > > > From: Jerry Lam 
> > > > To: user@hbase.apache.org; lars hofhansl 
> > > > Cc:
> > > > Sent: Tuesday, August 28, 2012 7:17 AM
> > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > > >
> > > > Hi Lars:
> > > >
> > > > Thanks for the reply.
> > > > I need to understand if I misunderstood the perceived inefficiency
> > > because
> > > > it seems you don't think quite th

Re: setTimeRange and setMaxVersions seem to be inefficient

2012-08-29 Thread Ted Yu
Jerry:
Remember to also implement:

+  @Override
+  public KeyValue getNextKeyHint(KeyValue currentKV) {

You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL.

Cheers

On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam  wrote:

> Hi Lars:
>
> Thanks for spending time discussing this with me. I appreciate it.
>
> I tried to implement the setMaxVersions(1) inside the filter as follows:
>
> @Override
> public ReturnCode filterKeyValue(KeyValue kv) {
>
> // check if the same qualifier as the one that has been included
> previously. If yes, jump to next column
> if (previousIncludedQualifier != null &&
> Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
> previousIncludedQualifier = null;
> return ReturnCode.NEXT_COL;
> }
> // another condition that makes the jump further using HINT
> if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
> LOG.info("Matched Found.");
> return ReturnCode.SEEK_NEXT_USING_HINT;
>
> }
> // include this to the result and keep track of the included
> qualifier so the next version of the same qualifier will be excluded
> previousIncludedQualifier = kv.getQualifier();
> return ReturnCode.INCLUDE;
> }
>
> Does this look reasonable or there is a better way to achieve this? It
> would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though.
>
> Best Regards,
>
> Jerry
>
>
> On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl 
> wrote:
>
> > Hi Jerry,
> >
> > my answer will be the same again:
> > Some folks will want the max versions set by the client to be before
> > filters and some folks will want it to restrict the end result.
> > It's not possible to have it both ways. Your filter needs to do the right
> > thing.
> >
> >
> > There's a lot of discussion around this in HBASE-5104.
> >
> >
> > -- Lars
> >
> >
> >
> > 
> >  From: Jerry Lam 
> > To: user@hbase.apache.org; lars hofhansl 
> > Sent: Tuesday, August 28, 2012 1:52 PM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > I see. Please refer to the inline comment below.
> >
> > Best Regards,
> >
> > Jerry
> >
> > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl 
> > wrote:
> >
> > > What I was saying was: It depends. :)
> > >
> > > First off, how do you get to 1000 versions? In 0.94++ older version are
> > > pruned upon flush, so you need 333 flushes (assuming 3 versions on the
> > CF)
> > > to get 1000 versions.
> > >
> >
> > I forgot that the default number of version to keep is 3. If this is what
> > people use most of the time, yes you are right for this type of scenarios
> > where the number of version per column to keep is small.
> >
> > By that time some compactions will have happened and you're back to close
> > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files
> you
> > > have).
> > >
> > > Now, if you have that many version because because you set
> VERSIONS=>1000
> > > in your CF... Then imagine you have 100 columns with 1000 versions
> each.
> > >
> >
> > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> > versioning myself)
> >
> > In your scenario below you'd do 10 comparisons if the filter would be
> > > evaluated after the version counting. But only 1100 with the current
> > code.
> > > (or at least in that ball park)
> > >
> >
> > This is where I don't quite understand what you mean.
> >
> > if the framework counts the number of ReturnCode.INCLUDE and then stops
> > feeding the KeyValue into the filterKeyValue method after it reaches the
> > count specified in setMaxVersions (i.e. 1 for the case we discussed),
> > should then be just 100 comparisons only (at most) instead of 1100
> > comparisons? Maybe I don't understand how the current way is doing...
> >
> >
> >
> > >
> > > The gist is: One can construct scenarios where one approach is better
> > than
> > > the other. Only one order is possible.
> > > If you write a custom filter and you care about these things you should
> > > use the seek hints.
> > >
> > > -- Lars
> > >
> > >
> > > - Original Message -
> > > From: Jerry Lam 
> > > To: user@hbase.apache.org; lars hofhansl 
> > > Cc:
> > > Sent: Tuesday, August 28, 2012 7:17 AM
> > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi Lars:
> > >
> > > Thanks for the reply.
> > > I need to understand if I misunderstood the perceived inefficiency
> > because
> > > it seems you don't think quite the same.
> > >
> > > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2)
> > in a
> > > table and each column has 1000 versions. Using the following code (the
> > code
> > > might have errors and don't compile):
> > > /**
> > > * This is very simple use case of a ColumnPrefixFilter.
> > > * In fact all other filters that make use of filterKeyValue will see
> > > similar
> > > * performance problems that I have concerned with when the number of
> > > * versions per column could be huge.
>

Re: Inconsistent scan performance with caching set to 1

2012-08-29 Thread Wayne
This is basically a read bug/performance problem. The execution path
followed when the caching is used up is not consistent with the initial
execution path/performance. Can anyone help shed light on this? Was there
any changes to 0.94 to introduce this (we have not tested on other
versions)? Any help or advice would be appreciated. As it is stands we are
looking to have to reverse engineer every aspect of a read from both the
hbase client and server components to find and fix.

One additional lead is that not all rows behave like this. Only certain
small rows seem to do this consistently. Most of our rows are larger and do
not have this behavior.

Thanks.

On Tue, Aug 28, 2012 at 4:35 PM, Jay T  wrote:

> We are running Hbase 0.94 with Hadoop 1.0.3. We use Python thrift to talk
> to Hbase.
>
>
> We are experiencing strange behavior when scanning specific rows from Hbase
> (Caching is always set to 1 in the scanner). Each of these rows are
> retrieved in (~12 ms) if they are the startRow of the scanner. However if
> they are somewhere in between they take (~ 42 ms) to read.
>
>
> Asumming Row1 Row2 Row3 Row4 ……..Row 10 are the row Keys.
>
>
> Scenario 1: (Scanner starts from Row1)
>
> 
>
>
> Row 1: 12 ms
>
> Row 2: 42 ms
>
> Row 3: 42 ms
>
> …
>
> …
>
> …
>
>
> Scenario 2: (Scanner starts from Row2)
>
> =
>
> Row 2: 12 ms
>
> Row 3: 42 ms
>
> Row 4: 42 ms
>
>
>
> Scenario 3: (Scanner starts from Row 3)
>
> ===
>
>
> Row 3: 12 ms
>
> Row 4: 42 ms
>
> Row 5: 42 ms
>
>
>
> You can see that Row 1 and Row 2 and Row 3 each take ~12 ms when they are
> the startRow of the scanner. However their performance degrades if they are
> part of the 'next" call to scanner (caching = 1).
>
> This behavior is seen with both Python thrift and with Java API as well.
>
> When the scan caching is increased to (say 10) then all the rows can be
> retrieved in 20 ms. I understand that by setting a higher caching size the
> number of RPC calls are reduced. However there seems to be something else
> at play.
>
> I added log statements to ServerCallable.java and HRegionServer.java and
> many other files to figure out where the time is lost.
>
>
> *2012-08-24 18:28:43,147 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: CurrentScanResultSize =
> 34976*
>
> *2012-08-24 06:28:43 188 INFO client.ScannerCallable: After client calls
> server*
>
> From the above two log statements I cannot figure out where is 41 ms being
> spent (188 - 147).
>
> Can someone explain to me what is going on after HRegionServer.next
> completes executing and the control goes back to ScannerCallable.java
> statement:
>
> rrs = server.next(scannerId, caching);
>
>
> I would greatly appreciate if someone would help me understand where the
> time might be getting spent.
>
> Thanks,
>
> Jay
>


Re: fast way to do random getRowOrAfter reads

2012-08-29 Thread jmozah
Get is internally treated very similar to SCANs in the RS. So this is expected.

./Zahoor
HBase Musings


On 29-Aug-2012, at 9:15 PM, Ferdy Galema  wrote:

> Ran some tests and it seems that single-use Scanner requests are not that
> bad after all. I guess the important part is to set row caching to 1 and
> correctly close every scanner afterwards.
> 
> On Mon, Aug 27, 2012 at 4:33 PM, Ferdy Galema wrote:
> 
>> I want to do a lot of random reads, but I need to get the first row after
>> the requested key. I know I can make a scanner every time (with a specified
>> startrow) and close it after a single result is fetched, but this seems
>> like a lot overhead.
>> 
>> Something like HTable's getRowOrBefore method, but then getRowOrAfter.
>> (Note that getRowOrBefore is deprecated).
>> 
>> Any advice?
>> 



Re: HBase Is So Slow To Save Data?

2012-08-29 Thread Bing Li
Dear Cristofer,

Thanks so much for your reminding!

Best regards,
Bing

On Thu, Aug 30, 2012 at 12:32 AM, Cristofer Weber <
cristofer.we...@neogrid.com> wrote:

> There's also a lot of conversions from same values to byte array
> representation, eg, your NeighborStructure constants. You should do this
> conversion only once to save time, since you are doing this inside 3 nested
> loops. Not sure about how much this can improve, but you should try this
> also.
>
> Best regards,
> Cristofer
>
> -Mensagem original-
> De: Bing Li [mailto:lbl...@gmail.com]
> Enviada em: quarta-feira, 29 de agosto de 2012 13:07
> Para: user@hbase.apache.org
> Cc: hbase-u...@hadoop.apache.org
> Assunto: Re: HBase Is So Slow To Save Data?
>
> I see. Thanks so much!
>
> Bing
>
>
> On Wed, Aug 29, 2012 at 11:59 PM, N Keywal  wrote:
>
> > It's not useful here: if you have a memory issue, it's when your using
> > the list, not when you have finished with it and set it to null.
> > You need to monitor the memory consumption of the jvm, both the client
> > & the server.
> > Google around these keywords, there are many examples on the web.
> > Google as well arrayList initialization.
> >
> > Note as well that the important is not the memory size of the
> > structure on disk but the size of the" List puts = new
> > ArrayList();" before the table put.
> >
> > On Wed, Aug 29, 2012 at 5:42 PM, Bing Li  wrote:
> >
> > > Dear N Keywal,
> > >
> > > Thanks so much for your reply!
> > >
> > > The total amount of data is about 110M. The available memory is
> > > enough,
> > 2G.
> > >
> > > In Java, I just set a collection to NULL to collect garbage. Do you
> > > think it is fine?
> > >
> > > Best regards,
> > > Bing
> > >
> > >
> > > On Wed, Aug 29, 2012 at 11:22 PM, N Keywal  wrote:
> > >
> > >> Hi Bing,
> > >>
> > >> You should expect HBase to be slower in the generic case:
> > >> 1) it writes much more data (see hbase data model), with extra
> > >> columns qualifiers, timestamps & so on.
> > >> 2) the data is written multiple times: once in the write-ahead-log,
> > >> once per replica on datanode & so on again.
> > >> 3) there are inter process calls & inter machine calls on the
> > >> critical path.
> > >>
> > >> This is the cost of the atomicity, reliability and scalability
> features.
> > >> With these features in mind, HBase is reasonably fast to save data
> > >> on a cluster.
> > >>
> > >> On your specific case (without the points 2 & 3 above), the
> > >> performance seems to be very bad.
> > >>
> > >> You should first look at:
> > >> - how much is spent in the put vs. preparing the list
> > >> - do you have garbage collection going on? even swap?
> > >> - what's the size of your final Array vs. the available memory?
> > >>
> > >> Cheers,
> > >>
> > >> N.
> > >>
> > >>
> > >>
> > >> On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:
> > >>
> > >>> Dear all,
> > >>>
> > >>> By the way, my HBase is in the pseudo-distributed mode. Thanks!
> > >>>
> > >>> Best regards,
> > >>> Bing
> > >>>
> > >>> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
> > >>>
> > >>> > Dear all,
> > >>> >
> > >>> > According to my experiences, it is very slow for HBase to save
> data?
> > >>> Am I
> > >>> > right?
> > >>> >
> > >>> > For example, today I need to save data in a HashMap to HBase. It
> > >>> > took about more than three hours. However when saving the same
> > >>> > HashMap in
> > a
> > >>> file
> > >>> > in the text format with the redirected System.out, it took only
> > >>> > 4.5
> > >>> seconds!
> > >>> >
> > >>> > Why is HBase so slow? It is indexing?
> > >>> >
> > >>> > My code to save data in HBase is as follows. I think the code
> > >>> > must be correct.
> > >>> >
> > >>> > ..
> > >>> > public synchronized void
> > >>> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap > >>> > ConcurrentHashMap>> hhOutNeighborMap, int
> > >>> timingScale)
> > >>> > {
> > >>> > List puts = new ArrayList();
> > >>> >
> > >>> > String hhNeighborRowKey;
> > >>> > Put hubKeyPut;
> > >>> > Put groupKeyPut;
> > >>> > Put topGroupKeyPut;
> > >>> > Put timingScalePut;
> > >>> > Put nodeKeyPut;
> > >>> > Put hubNeighborTypePut;
> > >>> >
> > >>> > for (Map.Entry > >>> > Set>> sourceHubGroupNeighborEntry :
> > >>> hhOutNeighborMap.entrySet())
> > >>> > {
> > >>> > for (Map.Entry>
> > >>> > groupNeighborEntry :
> > sourceHubGroupNeighborEntry.getValue().entrySet())
> > >>> > {
> > >>> > for (String neighborKey :
> > >>> > groupNeighborEntry.getValue())
> > >>> > {
> > >>> > hhNeighborRowKey =
> > >>> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
> > >>> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
> > >>> > groupNeighborEntry

RES: HBase Is So Slow To Save Data?

2012-08-29 Thread Cristofer Weber
There's also a lot of conversions from same values to byte array 
representation, eg, your NeighborStructure constants. You should do this 
conversion only once to save time, since you are doing this inside 3 nested 
loops. Not sure about how much this can improve, but you should try this also.

Best regards,
Cristofer

-Mensagem original-
De: Bing Li [mailto:lbl...@gmail.com] 
Enviada em: quarta-feira, 29 de agosto de 2012 13:07
Para: user@hbase.apache.org
Cc: hbase-u...@hadoop.apache.org
Assunto: Re: HBase Is So Slow To Save Data?

I see. Thanks so much!

Bing


On Wed, Aug 29, 2012 at 11:59 PM, N Keywal  wrote:

> It's not useful here: if you have a memory issue, it's when your using 
> the list, not when you have finished with it and set it to null.
> You need to monitor the memory consumption of the jvm, both the client 
> & the server.
> Google around these keywords, there are many examples on the web.
> Google as well arrayList initialization.
>
> Note as well that the important is not the memory size of the 
> structure on disk but the size of the" List puts = new 
> ArrayList();" before the table put.
>
> On Wed, Aug 29, 2012 at 5:42 PM, Bing Li  wrote:
>
> > Dear N Keywal,
> >
> > Thanks so much for your reply!
> >
> > The total amount of data is about 110M. The available memory is 
> > enough,
> 2G.
> >
> > In Java, I just set a collection to NULL to collect garbage. Do you 
> > think it is fine?
> >
> > Best regards,
> > Bing
> >
> >
> > On Wed, Aug 29, 2012 at 11:22 PM, N Keywal  wrote:
> >
> >> Hi Bing,
> >>
> >> You should expect HBase to be slower in the generic case:
> >> 1) it writes much more data (see hbase data model), with extra 
> >> columns qualifiers, timestamps & so on.
> >> 2) the data is written multiple times: once in the write-ahead-log, 
> >> once per replica on datanode & so on again.
> >> 3) there are inter process calls & inter machine calls on the 
> >> critical path.
> >>
> >> This is the cost of the atomicity, reliability and scalability features.
> >> With these features in mind, HBase is reasonably fast to save data 
> >> on a cluster.
> >>
> >> On your specific case (without the points 2 & 3 above), the 
> >> performance seems to be very bad.
> >>
> >> You should first look at:
> >> - how much is spent in the put vs. preparing the list
> >> - do you have garbage collection going on? even swap?
> >> - what's the size of your final Array vs. the available memory?
> >>
> >> Cheers,
> >>
> >> N.
> >>
> >>
> >>
> >> On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:
> >>
> >>> Dear all,
> >>>
> >>> By the way, my HBase is in the pseudo-distributed mode. Thanks!
> >>>
> >>> Best regards,
> >>> Bing
> >>>
> >>> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
> >>>
> >>> > Dear all,
> >>> >
> >>> > According to my experiences, it is very slow for HBase to save data?
> >>> Am I
> >>> > right?
> >>> >
> >>> > For example, today I need to save data in a HashMap to HBase. It 
> >>> > took about more than three hours. However when saving the same 
> >>> > HashMap in
> a
> >>> file
> >>> > in the text format with the redirected System.out, it took only 
> >>> > 4.5
> >>> seconds!
> >>> >
> >>> > Why is HBase so slow? It is indexing?
> >>> >
> >>> > My code to save data in HBase is as follows. I think the code 
> >>> > must be correct.
> >>> >
> >>> > ..
> >>> > public synchronized void 
> >>> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap >>> > ConcurrentHashMap>> hhOutNeighborMap, int
> >>> timingScale)
> >>> > {
> >>> > List puts = new ArrayList();
> >>> >
> >>> > String hhNeighborRowKey;
> >>> > Put hubKeyPut;
> >>> > Put groupKeyPut;
> >>> > Put topGroupKeyPut;
> >>> > Put timingScalePut;
> >>> > Put nodeKeyPut;
> >>> > Put hubNeighborTypePut;
> >>> >
> >>> > for (Map.Entry >>> > Set>> sourceHubGroupNeighborEntry :
> >>> hhOutNeighborMap.entrySet())
> >>> > {
> >>> > for (Map.Entry> 
> >>> > groupNeighborEntry :
> sourceHubGroupNeighborEntry.getValue().entrySet())
> >>> > {
> >>> > for (String neighborKey :
> >>> > groupNeighborEntry.getValue())
> >>> > {
> >>> > hhNeighborRowKey = 
> >>> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
> >>> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
> >>> > groupNeighborEntry.getKey() + timingScale + neighborKey);
> >>> >
> >>> > hubKeyPut = new 
> >>> > Put(Bytes.toBytes(hhNeighborRowKey));
> >>> >
> >>> >
> hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY)
> ,
> >>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN)
> >>> > , Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
> >>> >   

Re: HBase Is So Slow To Save Data?

2012-08-29 Thread Bing Li
I see. Thanks so much!

Bing


On Wed, Aug 29, 2012 at 11:59 PM, N Keywal  wrote:

> It's not useful here: if you have a memory issue, it's when your using the
> list, not when you have finished with it and set it to null.
> You need to monitor the memory consumption of the jvm, both the client &
> the server.
> Google around these keywords, there are many examples on the web.
> Google as well arrayList initialization.
>
> Note as well that the important is not the memory size of the structure on
> disk but the size of the" List puts = new ArrayList();" before
> the table put.
>
> On Wed, Aug 29, 2012 at 5:42 PM, Bing Li  wrote:
>
> > Dear N Keywal,
> >
> > Thanks so much for your reply!
> >
> > The total amount of data is about 110M. The available memory is enough,
> 2G.
> >
> > In Java, I just set a collection to NULL to collect garbage. Do you think
> > it is fine?
> >
> > Best regards,
> > Bing
> >
> >
> > On Wed, Aug 29, 2012 at 11:22 PM, N Keywal  wrote:
> >
> >> Hi Bing,
> >>
> >> You should expect HBase to be slower in the generic case:
> >> 1) it writes much more data (see hbase data model), with extra columns
> >> qualifiers, timestamps & so on.
> >> 2) the data is written multiple times: once in the write-ahead-log, once
> >> per replica on datanode & so on again.
> >> 3) there are inter process calls & inter machine calls on the critical
> >> path.
> >>
> >> This is the cost of the atomicity, reliability and scalability features.
> >> With these features in mind, HBase is reasonably fast to save data on a
> >> cluster.
> >>
> >> On your specific case (without the points 2 & 3 above), the performance
> >> seems to be very bad.
> >>
> >> You should first look at:
> >> - how much is spent in the put vs. preparing the list
> >> - do you have garbage collection going on? even swap?
> >> - what's the size of your final Array vs. the available memory?
> >>
> >> Cheers,
> >>
> >> N.
> >>
> >>
> >>
> >> On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:
> >>
> >>> Dear all,
> >>>
> >>> By the way, my HBase is in the pseudo-distributed mode. Thanks!
> >>>
> >>> Best regards,
> >>> Bing
> >>>
> >>> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
> >>>
> >>> > Dear all,
> >>> >
> >>> > According to my experiences, it is very slow for HBase to save data?
> >>> Am I
> >>> > right?
> >>> >
> >>> > For example, today I need to save data in a HashMap to HBase. It took
> >>> > about more than three hours. However when saving the same HashMap in
> a
> >>> file
> >>> > in the text format with the redirected System.out, it took only 4.5
> >>> seconds!
> >>> >
> >>> > Why is HBase so slow? It is indexing?
> >>> >
> >>> > My code to save data in HBase is as follows. I think the code must be
> >>> > correct.
> >>> >
> >>> > ..
> >>> > public synchronized void
> >>> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap >>> > ConcurrentHashMap>> hhOutNeighborMap, int
> >>> timingScale)
> >>> > {
> >>> > List puts = new ArrayList();
> >>> >
> >>> > String hhNeighborRowKey;
> >>> > Put hubKeyPut;
> >>> > Put groupKeyPut;
> >>> > Put topGroupKeyPut;
> >>> > Put timingScalePut;
> >>> > Put nodeKeyPut;
> >>> > Put hubNeighborTypePut;
> >>> >
> >>> > for (Map.Entry >>> > Set>> sourceHubGroupNeighborEntry :
> >>> hhOutNeighborMap.entrySet())
> >>> > {
> >>> > for (Map.Entry>
> >>> > groupNeighborEntry :
> sourceHubGroupNeighborEntry.getValue().entrySet())
> >>> > {
> >>> > for (String neighborKey :
> >>> > groupNeighborEntry.getValue())
> >>> > {
> >>> > hhNeighborRowKey =
> >>> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
> >>> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
> >>> > groupNeighborEntry.getKey() + timingScale + neighborKey);
> >>> >
> >>> > hubKeyPut = new
> >>> > Put(Bytes.toBytes(hhNeighborRowKey));
> >>> >
> >>> >
> hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> >>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
> >>> > Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
> >>> > puts.add(hubKeyPut);
> >>> >
> >>> > groupKeyPut = new
> >>> > Put(Bytes.toBytes(hhNeighborRowKey));
> >>> >
> >>> >
> >>>
> groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> >>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
> >>> > Bytes.toBytes(groupNeighborEntry.getKey()));
> >>> > puts.add(groupKeyPut);
> >>> >
> >>> > topGroupKeyPut = new
> >>> > Put(Bytes.toBytes(hhNeighborRowKey));
> >>> >
> >>> >

Re: HBase Is So Slow To Save Data?

2012-08-29 Thread N Keywal
It's not useful here: if you have a memory issue, it's when your using the
list, not when you have finished with it and set it to null.
You need to monitor the memory consumption of the jvm, both the client &
the server.
Google around these keywords, there are many examples on the web.
Google as well arrayList initialization.

Note as well that the important is not the memory size of the structure on
disk but the size of the" List puts = new ArrayList();" before
the table put.

On Wed, Aug 29, 2012 at 5:42 PM, Bing Li  wrote:

> Dear N Keywal,
>
> Thanks so much for your reply!
>
> The total amount of data is about 110M. The available memory is enough, 2G.
>
> In Java, I just set a collection to NULL to collect garbage. Do you think
> it is fine?
>
> Best regards,
> Bing
>
>
> On Wed, Aug 29, 2012 at 11:22 PM, N Keywal  wrote:
>
>> Hi Bing,
>>
>> You should expect HBase to be slower in the generic case:
>> 1) it writes much more data (see hbase data model), with extra columns
>> qualifiers, timestamps & so on.
>> 2) the data is written multiple times: once in the write-ahead-log, once
>> per replica on datanode & so on again.
>> 3) there are inter process calls & inter machine calls on the critical
>> path.
>>
>> This is the cost of the atomicity, reliability and scalability features.
>> With these features in mind, HBase is reasonably fast to save data on a
>> cluster.
>>
>> On your specific case (without the points 2 & 3 above), the performance
>> seems to be very bad.
>>
>> You should first look at:
>> - how much is spent in the put vs. preparing the list
>> - do you have garbage collection going on? even swap?
>> - what's the size of your final Array vs. the available memory?
>>
>> Cheers,
>>
>> N.
>>
>>
>>
>> On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:
>>
>>> Dear all,
>>>
>>> By the way, my HBase is in the pseudo-distributed mode. Thanks!
>>>
>>> Best regards,
>>> Bing
>>>
>>> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
>>>
>>> > Dear all,
>>> >
>>> > According to my experiences, it is very slow for HBase to save data?
>>> Am I
>>> > right?
>>> >
>>> > For example, today I need to save data in a HashMap to HBase. It took
>>> > about more than three hours. However when saving the same HashMap in a
>>> file
>>> > in the text format with the redirected System.out, it took only 4.5
>>> seconds!
>>> >
>>> > Why is HBase so slow? It is indexing?
>>> >
>>> > My code to save data in HBase is as follows. I think the code must be
>>> > correct.
>>> >
>>> > ..
>>> > public synchronized void
>>> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap>> > ConcurrentHashMap>> hhOutNeighborMap, int
>>> timingScale)
>>> > {
>>> > List puts = new ArrayList();
>>> >
>>> > String hhNeighborRowKey;
>>> > Put hubKeyPut;
>>> > Put groupKeyPut;
>>> > Put topGroupKeyPut;
>>> > Put timingScalePut;
>>> > Put nodeKeyPut;
>>> > Put hubNeighborTypePut;
>>> >
>>> > for (Map.Entry>> > Set>> sourceHubGroupNeighborEntry :
>>> hhOutNeighborMap.entrySet())
>>> > {
>>> > for (Map.Entry>
>>> > groupNeighborEntry : sourceHubGroupNeighborEntry.getValue().entrySet())
>>> > {
>>> > for (String neighborKey :
>>> > groupNeighborEntry.getValue())
>>> > {
>>> > hhNeighborRowKey =
>>> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
>>> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
>>> > groupNeighborEntry.getKey() + timingScale + neighborKey);
>>> >
>>> > hubKeyPut = new
>>> > Put(Bytes.toBytes(hhNeighborRowKey));
>>> >
>>> > hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
>>> > Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
>>> > puts.add(hubKeyPut);
>>> >
>>> > groupKeyPut = new
>>> > Put(Bytes.toBytes(hhNeighborRowKey));
>>> >
>>> >
>>> groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
>>> > Bytes.toBytes(groupNeighborEntry.getKey()));
>>> > puts.add(groupKeyPut);
>>> >
>>> > topGroupKeyPut = new
>>> > Put(Bytes.toBytes(hhNeighborRowKey));
>>> >
>>> >
>>> topGroupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TOP_GROUP_KEY_COLUMN),
>>> >
>>> Bytes.toBytes(GroupRegistry.WWW().GetParentGroupKey(groupNeighborEntry.getKey(;
>>> > puts.add(topGroupKeyPut);
>>> >
>>> > 

Re: HBase Is So Slow To Save Data?

2012-08-29 Thread Mohammad Tariq
I don't think 2G is sufficient enough keeping in mind that all the hadoop
daemons are running on the same box. (Maybe your IDE and other stuff too).

On Wednesday, August 29, 2012, Mohammad Tariq  wrote:
> Pseudo-distributed setup could be a cause.
>
> On Wednesday, August 29, 2012, N Keywal  wrote:
>> Hi Bing,
>>
>> You should expect HBase to be slower in the generic case:
>> 1) it writes much more data (see hbase data model), with extra columns
>> qualifiers, timestamps & so on.
>> 2) the data is written multiple times: once in the write-ahead-log, once
>> per replica on datanode & so on again.
>> 3) there are inter process calls & inter machine calls on the critical
path.
>>
>> This is the cost of the atomicity, reliability and scalability features.
>> With these features in mind, HBase is reasonably fast to save data on a
>> cluster.
>>
>> On your specific case (without the points 2 & 3 above), the performance
>> seems to be very bad.
>>
>> You should first look at:
>> - how much is spent in the put vs. preparing the list
>> - do you have garbage collection going on? even swap?
>> - what's the size of your final Array vs. the available memory?
>>
>> Cheers,
>>
>> N.
>>
>>
>> On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:
>>
>>> Dear all,
>>>
>>> By the way, my HBase is in the pseudo-distributed mode. Thanks!
>>>
>>> Best regards,
>>> Bing
>>>
>>> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
>>>
>>> > Dear all,
>>> >
>>> > According to my experiences, it is very slow for HBase to save data?
Am I
>>> > right?
>>> >
>>> > For example, today I need to save data in a HashMap to HBase. It took
>>> > about more than three hours. However when saving the same HashMap in a
>>> file
>>> > in the text format with the redirected System.out, it took only 4.5
>>> seconds!
>>> >
>>> > Why is HBase so slow? It is indexing?
>>> >
>>> > My code to save data in HBase is as follows. I think the code must be
>>> > correct.
>>> >
>>> > ..
>>> > public synchronized void
>>> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap>> > ConcurrentHashMap>> hhOutNeighborMap, int
>>> timingScale)
>>> > {
>>> > List puts = new ArrayList();
>>> >
>>> > String hhNeighborRowKey;
>>> > Put hubKeyPut;
>>> > Put groupKeyPut;
>>> > Put topGroupKeyPut;
>>> > Put timingScalePut;
>>> > Put nodeKeyPut;
>>> > Put hubNeighborTypePut;
>>> >
>>> > for (Map.Entry>> > Set>> sourceHubGroupNeighborEntry :
hhOutNeighborMap.entrySet())
>>> > {
>>> > for (Map.Entry>
>>> > groupNeighborEntry :
sourceHubGroupNeighborEntry.getValue().entrySet())
>>> > {
>>> > for (String neighborKey :
>>> > groupNeighborEntry.getValue())
>>> > {
>>> > hhNeighborRowKey =
>>> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
>>> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
>>> > groupNeighborEntry.getKey() + timingScale + neighborKey);
>>> >
>>> > hubKeyPut = new
>>> > Put(Bytes.toBytes(hhNeighborRowKey));
>>> >
>>> >
hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
>>> > Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
>>> > puts.add(hubKeyPut);
>>> >
>>> > groupKeyPut = new
>>> > Put(Bytes.toBytes(hhNeighborRowKey));
>>> >
>>> >
groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
>>> > Bytes.toBytes(groupNeig--
> Regards,
> Mohammad Tariq
>

-- 
Regards,
Mohammad Tariq


Re: fast way to do random getRowOrAfter reads

2012-08-29 Thread Ferdy Galema
Ran some tests and it seems that single-use Scanner requests are not that
bad after all. I guess the important part is to set row caching to 1 and
correctly close every scanner afterwards.

On Mon, Aug 27, 2012 at 4:33 PM, Ferdy Galema wrote:

> I want to do a lot of random reads, but I need to get the first row after
> the requested key. I know I can make a scanner every time (with a specified
> startrow) and close it after a single result is fetched, but this seems
> like a lot overhead.
>
> Something like HTable's getRowOrBefore method, but then getRowOrAfter.
> (Note that getRowOrBefore is deprecated).
>
> Any advice?
>


Re: HBase Is So Slow To Save Data?

2012-08-29 Thread Mohammad Tariq
Pseudo-distributed setup could be a cause.

On Wednesday, August 29, 2012, N Keywal  wrote:
> Hi Bing,
>
> You should expect HBase to be slower in the generic case:
> 1) it writes much more data (see hbase data model), with extra columns
> qualifiers, timestamps & so on.
> 2) the data is written multiple times: once in the write-ahead-log, once
> per replica on datanode & so on again.
> 3) there are inter process calls & inter machine calls on the critical
path.
>
> This is the cost of the atomicity, reliability and scalability features.
> With these features in mind, HBase is reasonably fast to save data on a
> cluster.
>
> On your specific case (without the points 2 & 3 above), the performance
> seems to be very bad.
>
> You should first look at:
> - how much is spent in the put vs. preparing the list
> - do you have garbage collection going on? even swap?
> - what's the size of your final Array vs. the available memory?
>
> Cheers,
>
> N.
>
>
> On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:
>
>> Dear all,
>>
>> By the way, my HBase is in the pseudo-distributed mode. Thanks!
>>
>> Best regards,
>> Bing
>>
>> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
>>
>> > Dear all,
>> >
>> > According to my experiences, it is very slow for HBase to save data?
Am I
>> > right?
>> >
>> > For example, today I need to save data in a HashMap to HBase. It took
>> > about more than three hours. However when saving the same HashMap in a
>> file
>> > in the text format with the redirected System.out, it took only 4.5
>> seconds!
>> >
>> > Why is HBase so slow? It is indexing?
>> >
>> > My code to save data in HBase is as follows. I think the code must be
>> > correct.
>> >
>> > ..
>> > public synchronized void
>> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap> > ConcurrentHashMap>> hhOutNeighborMap, int
>> timingScale)
>> > {
>> > List puts = new ArrayList();
>> >
>> > String hhNeighborRowKey;
>> > Put hubKeyPut;
>> > Put groupKeyPut;
>> > Put topGroupKeyPut;
>> > Put timingScalePut;
>> > Put nodeKeyPut;
>> > Put hubNeighborTypePut;
>> >
>> > for (Map.Entry> > Set>> sourceHubGroupNeighborEntry :
hhOutNeighborMap.entrySet())
>> > {
>> > for (Map.Entry>
>> > groupNeighborEntry : sourceHubGroupNeighborEntry.getValue().entrySet())
>> > {
>> > for (String neighborKey :
>> > groupNeighborEntry.getValue())
>> > {
>> > hhNeighborRowKey =
>> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
>> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
>> > groupNeighborEntry.getKey() + timingScale + neighborKey);
>> >
>> > hubKeyPut = new
>> > Put(Bytes.toBytes(hhNeighborRowKey));
>> >
>> > hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
>> > Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
>> > puts.add(hubKeyPut);
>> >
>> > groupKeyPut = new
>> > Put(Bytes.toBytes(hhNeighborRowKey));
>> >
>> >
groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
>> > Bytes.toBytes(groupNeighborEntry.getKey()));
>> >

-- 
Regards,
Mohammad Tariq


Re: HBase Is So Slow To Save Data?

2012-08-29 Thread Bing Li
Dear N Keywal,

Thanks so much for your reply!

The total amount of data is about 110M. The available memory is enough, 2G.

In Java, I just set a collection to NULL to collect garbage. Do you think
it is fine?

Best regards,
Bing

On Wed, Aug 29, 2012 at 11:22 PM, N Keywal  wrote:

> Hi Bing,
>
> You should expect HBase to be slower in the generic case:
> 1) it writes much more data (see hbase data model), with extra columns
> qualifiers, timestamps & so on.
> 2) the data is written multiple times: once in the write-ahead-log, once
> per replica on datanode & so on again.
> 3) there are inter process calls & inter machine calls on the critical
> path.
>
> This is the cost of the atomicity, reliability and scalability features.
> With these features in mind, HBase is reasonably fast to save data on a
> cluster.
>
> On your specific case (without the points 2 & 3 above), the performance
> seems to be very bad.
>
> You should first look at:
> - how much is spent in the put vs. preparing the list
> - do you have garbage collection going on? even swap?
> - what's the size of your final Array vs. the available memory?
>
> Cheers,
>
> N.
>
>
>
> On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:
>
>> Dear all,
>>
>> By the way, my HBase is in the pseudo-distributed mode. Thanks!
>>
>> Best regards,
>> Bing
>>
>> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
>>
>> > Dear all,
>> >
>> > According to my experiences, it is very slow for HBase to save data? Am
>> I
>> > right?
>> >
>> > For example, today I need to save data in a HashMap to HBase. It took
>> > about more than three hours. However when saving the same HashMap in a
>> file
>> > in the text format with the redirected System.out, it took only 4.5
>> seconds!
>> >
>> > Why is HBase so slow? It is indexing?
>> >
>> > My code to save data in HBase is as follows. I think the code must be
>> > correct.
>> >
>> > ..
>> > public synchronized void
>> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap> > ConcurrentHashMap>> hhOutNeighborMap, int
>> timingScale)
>> > {
>> > List puts = new ArrayList();
>> >
>> > String hhNeighborRowKey;
>> > Put hubKeyPut;
>> > Put groupKeyPut;
>> > Put topGroupKeyPut;
>> > Put timingScalePut;
>> > Put nodeKeyPut;
>> > Put hubNeighborTypePut;
>> >
>> > for (Map.Entry> > Set>> sourceHubGroupNeighborEntry : hhOutNeighborMap.entrySet())
>> > {
>> > for (Map.Entry>
>> > groupNeighborEntry : sourceHubGroupNeighborEntry.getValue().entrySet())
>> > {
>> > for (String neighborKey :
>> > groupNeighborEntry.getValue())
>> > {
>> > hhNeighborRowKey =
>> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
>> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
>> > groupNeighborEntry.getKey() + timingScale + neighborKey);
>> >
>> > hubKeyPut = new
>> > Put(Bytes.toBytes(hhNeighborRowKey));
>> >
>> > hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
>> > Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
>> > puts.add(hubKeyPut);
>> >
>> > groupKeyPut = new
>> > Put(Bytes.toBytes(hhNeighborRowKey));
>> >
>> >
>> groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
>> > Bytes.toBytes(groupNeighborEntry.getKey()));
>> > puts.add(groupKeyPut);
>> >
>> > topGroupKeyPut = new
>> > Put(Bytes.toBytes(hhNeighborRowKey));
>> >
>> >
>> topGroupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TOP_GROUP_KEY_COLUMN),
>> >
>> Bytes.toBytes(GroupRegistry.WWW().GetParentGroupKey(groupNeighborEntry.getKey(;
>> > puts.add(topGroupKeyPut);
>> >
>> > timingScalePut = new
>> > Put(Bytes.toBytes(hhNeighborRowKey));
>> >
>> >
>> timingScalePut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TIMING_SCALE_COLUMN),
>> > Bytes.toBytes(timingScale));
>> > puts.add(timingScalePut);
>> >
>> > nodeKeyPut = new
>> > Put(Bytes.toBytes(hhNeighborRowKey));
>> >
>> > nodeKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
>> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_NODE_KEY_COLUMN),
>> > Bytes.toBytes(neighborKey));
>> > 

Re: HBase Is So Slow To Save Data?

2012-08-29 Thread N Keywal
Hi Bing,

You should expect HBase to be slower in the generic case:
1) it writes much more data (see hbase data model), with extra columns
qualifiers, timestamps & so on.
2) the data is written multiple times: once in the write-ahead-log, once
per replica on datanode & so on again.
3) there are inter process calls & inter machine calls on the critical path.

This is the cost of the atomicity, reliability and scalability features.
With these features in mind, HBase is reasonably fast to save data on a
cluster.

On your specific case (without the points 2 & 3 above), the performance
seems to be very bad.

You should first look at:
- how much is spent in the put vs. preparing the list
- do you have garbage collection going on? even swap?
- what's the size of your final Array vs. the available memory?

Cheers,

N.


On Wed, Aug 29, 2012 at 4:08 PM, Bing Li  wrote:

> Dear all,
>
> By the way, my HBase is in the pseudo-distributed mode. Thanks!
>
> Best regards,
> Bing
>
> On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:
>
> > Dear all,
> >
> > According to my experiences, it is very slow for HBase to save data? Am I
> > right?
> >
> > For example, today I need to save data in a HashMap to HBase. It took
> > about more than three hours. However when saving the same HashMap in a
> file
> > in the text format with the redirected System.out, it took only 4.5
> seconds!
> >
> > Why is HBase so slow? It is indexing?
> >
> > My code to save data in HBase is as follows. I think the code must be
> > correct.
> >
> > ..
> > public synchronized void
> > AddVirtualOutgoingHHNeighbors(ConcurrentHashMap > ConcurrentHashMap>> hhOutNeighborMap, int
> timingScale)
> > {
> > List puts = new ArrayList();
> >
> > String hhNeighborRowKey;
> > Put hubKeyPut;
> > Put groupKeyPut;
> > Put topGroupKeyPut;
> > Put timingScalePut;
> > Put nodeKeyPut;
> > Put hubNeighborTypePut;
> >
> > for (Map.Entry > Set>> sourceHubGroupNeighborEntry : hhOutNeighborMap.entrySet())
> > {
> > for (Map.Entry>
> > groupNeighborEntry : sourceHubGroupNeighborEntry.getValue().entrySet())
> > {
> > for (String neighborKey :
> > groupNeighborEntry.getValue())
> > {
> > hhNeighborRowKey =
> > NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
> > Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
> > groupNeighborEntry.getKey() + timingScale + neighborKey);
> >
> > hubKeyPut = new
> > Put(Bytes.toBytes(hhNeighborRowKey));
> >
> > hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
> > Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
> > puts.add(hubKeyPut);
> >
> > groupKeyPut = new
> > Put(Bytes.toBytes(hhNeighborRowKey));
> >
> > groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
> > Bytes.toBytes(groupNeighborEntry.getKey()));
> > puts.add(groupKeyPut);
> >
> > topGroupKeyPut = new
> > Put(Bytes.toBytes(hhNeighborRowKey));
> >
> >
> topGroupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TOP_GROUP_KEY_COLUMN),
> >
> Bytes.toBytes(GroupRegistry.WWW().GetParentGroupKey(groupNeighborEntry.getKey(;
> > puts.add(topGroupKeyPut);
> >
> > timingScalePut = new
> > Put(Bytes.toBytes(hhNeighborRowKey));
> >
> >
> timingScalePut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TIMING_SCALE_COLUMN),
> > Bytes.toBytes(timingScale));
> > puts.add(timingScalePut);
> >
> > nodeKeyPut = new
> > Put(Bytes.toBytes(hhNeighborRowKey));
> >
> > nodeKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_NODE_KEY_COLUMN),
> > Bytes.toBytes(neighborKey));
> > puts.add(nodeKeyPut);
> >
> > hubNeighborTypePut = new
> > Put(Bytes.toBytes(hhNeighborRowKey));
> >
> >
> hubNeighborTypePut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> > Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TYPE_COLUMN),
> > Bytes.toBytes(SocialRole.VIRTUAL_NEIGHBOR));
> > puts.add(hubNeighborTypePut);

bulk loading - region creation/pre-spliting

2012-08-29 Thread Oleg Ruchovets
Hi ,
I have bulk loading job.
My job is for  User data aggregation.
Before I run Bulk Loading aggregation I want  to create regions
UserID looks like this  :

943e2c6d66d732e06ab257903f240d27


a0617cb2b964690a39b0d93e7fe2f021


ac85b4dee6d8c8495d61201234dfb73e


b8416d5e0fe2a1228f042dffa8d291e2


c422be9e75d28d9afe0f1f98f59cda92


fe6b0ad1822455958586e240eb75c1d7


1790ee2ce4487d976cd9eddd036275d6


344c3de9449a9522d2a4de8bb9e81b02


4fcccd6790aec3056f897741b467d08c


6b67dc1922e4fc0cd6fa31f64bd51ef3


87f1374e7c900a243450f5b5c3a2b2b9


a4180db6a62f300cdecf77310f0010ac



I have ~ 50.000.000 users. I run aggregation on daily basis and per day I
have ~ 30 regions.
So The objective is to create 30 regions with more or less equal
distributions.

The question is : What is the best practice to verify start / end key for
regions in my use case?

Thanks in advance
Oleg.


Re: bulk loading problem

2012-08-29 Thread Oleg Ruchovets
Great.
It works 

On Tue, Aug 28, 2012 at 6:42 PM, Igal Shilman  wrote:

> As suggested by the book, take a look at:
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles class,
>
> This tool expects two arguments: (1) the path to the generated HFiles (in
> your case it's outputPath) (2) the target table.
> To use it programatically, you can either invoke it via the ToolRunner, or
> calling LoadIncrementalHFiles.doBulkLoad() by yourself.
> (after your M/R job has successfully finished)
>
> If you are already loading to an existing table, then: (following your
> code)
>
> LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
> > int ret = loader.doBulkLoad(new Path(outputPath), new HTable(conf,
> > tableName));
>
>
> Otherwise,
>
>
> > int ret = ToolRunner.run(new LoadIncrementalHFiles(conf),
> > new String[] {outputPath, tableName});
>
>
>
> Good luck,
> Igal.
>
> On Tue, Aug 28, 2012 at 10:59 PM, Oleg Ruchovets  >wrote:
>
> > Hi Igal , thank you for the quick response  .
> >Can I execute this step programmatically?
> >
> > From link you sent :
> >
> > 9.8.5. Advanced Usage
> >
> > Although the importtsv tool is useful in many cases, advanced users may
> > want to generate data programatically, or import data from other formats.
> > To get started doing so, dig into ImportTsv.java and check the JavaDoc
> for
> > HFileOutputFormat.
> >
> > The import step of the bulk load can also be done programatically. See
> the
> > LoadIncrementalHFiles class for more information.
> > The question is : what should I do/add to my job to write generated
> HFiles
> > programmatically to Hbase?
> >
> >
> >
> >
> > On Tue, Aug 28, 2012 at 8:08 PM, Igal Shilman  wrote:
> >
> > > Hi,
> > > You need to complete the bulk load.
> > > Check out http://hbase.apache.org/book/arch.bulk.load.html 9.8.2
> > >
> > > Igal.
> > >
> > > On Tue, Aug 28, 2012 at 7:29 PM, Oleg Ruchovets  > > >wrote:
> > >
> > > > Hi ,
> > > >I am on process to write my first bulk loading job. I use Cloudera
> > > > CDH3U3 with hbase 0.90.4
> > > >
> > > > Executing a job I see HFiles   which created after job finished but
> > there
> > > > were  no entries in hbase. hbase shell >> count  'uu_bulk'  return 0.
> > > >
> > > > Here is my job configuration:
> > > >
> > > > Configuration  conf =  HBaseConfiguration.create();
> > > >
> > > >Job job = new Job(conf, getClass().getSimpleName());
> > > >
> > > > job.setJarByClass(UuPushMapReduceJobFactory.class);
> > > > job.setMapperClass(UuPushMapper.class);
> > > > job.setMapOutputKeyClass(ImmutableBytesWritable.class);
> > > > job.setMapOutputValueClass(KeyValue.class);
> > > > job.setOutputFormatClass(HFileOutputFormat.class);
> > > >
> > > >
> > > >
> > > > String path = uuAggregationContext.getUuInputPath();
> > > > String outputPath =
> > > > "/bulk_loading_hbase/output/"+System.currentTimeMillis();
> > > > LOG.info("path = " + path);
> > > > LOG.info("outputPath = " + outputPath);
> > > >
> > > > final String tableName = "uu_bulk";
> > > > LOG.info("hbase tableName: " + tableName);
> > > > createRegions(conf , Bytes.toBytes(tableName));
> > > >
> > > > FileInputFormat.addInputPath(job, new Path(path));
> > > > FileOutputFormat.setOutputPath(job, new Path(outputPath));
> > > >
> > > > HFileOutputFormat.configureIncrementalLoad(job, new
> > HTable(conf,
> > > > tableName));
> > > >
> > > >
> > >
> >
> //=
> > > > Reducers log ends
> > > >
> > > > 2012-08-28 11:53:40,643 INFO org.apache.hadoop.mapred.Merger: Down to
> > > > the last merge-pass, with 10 segments left of total size: 222885367
> > > > bytes
> > > > 2012-08-28 11:53:54,137 INFO
> > > > org.apache.hadoop.hbase.mapreduce.HFileOutputFormat:
> > > >
> > > >
> > >
> >
> Writer=hdfs://hdn16/bulk_loading_hbase/output/1346194117045/_temporary/_attempt_201208260949_0026_r_05_0/d/3908303205246218823,
> > > > wrote=268435455
> > > > 2012-08-28 11:54:11,966 INFO org.apache.hadoop.mapred.Task:
> > > > Task:attempt_201208260949_0026_r_05_0 is done. And is in the
> > > > process of commiting
> > > > 2012-08-28 11:54:12,975 INFO org.apache.hadoop.mapred.Task: Task
> > > > attempt_201208260949_0026_r_05_0 is allowed to commit now
> > > > 2012-08-28 11:54:13,007 INFO
> > > > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved
> > > > output of task 'attempt_201208260949_0026_r_05_0' to
> > > > /bulk_loading_hbase/output/1346194117045
> > > > 2012-08-28 11:54:13,009 INFO org.apache.hadoop.mapred.Task: Task
> > > > 'attempt_201208260949_0026_r_05_0' done.
> > > > 2012-08-28 11:54:13,010 INFO
> > > > org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'
> > > > truncater with mapRetainSize=-1 and reduceRetainSize=-1
> > > >
> > > > As I understand HFiles were written
> > > > 

Re: HBase Is So Slow To Save Data?

2012-08-29 Thread Bing Li
Dear all,

By the way, my HBase is in the pseudo-distributed mode. Thanks!

Best regards,
Bing

On Wed, Aug 29, 2012 at 10:04 PM, Bing Li  wrote:

> Dear all,
>
> According to my experiences, it is very slow for HBase to save data? Am I
> right?
>
> For example, today I need to save data in a HashMap to HBase. It took
> about more than three hours. However when saving the same HashMap in a file
> in the text format with the redirected System.out, it took only 4.5 seconds!
>
> Why is HBase so slow? It is indexing?
>
> My code to save data in HBase is as follows. I think the code must be
> correct.
>
> ..
> public synchronized void
> AddVirtualOutgoingHHNeighbors(ConcurrentHashMap ConcurrentHashMap>> hhOutNeighborMap, int timingScale)
> {
> List puts = new ArrayList();
>
> String hhNeighborRowKey;
> Put hubKeyPut;
> Put groupKeyPut;
> Put topGroupKeyPut;
> Put timingScalePut;
> Put nodeKeyPut;
> Put hubNeighborTypePut;
>
> for (Map.Entry Set>> sourceHubGroupNeighborEntry : hhOutNeighborMap.entrySet())
> {
> for (Map.Entry>
> groupNeighborEntry : sourceHubGroupNeighborEntry.getValue().entrySet())
> {
> for (String neighborKey :
> groupNeighborEntry.getValue())
> {
> hhNeighborRowKey =
> NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
> Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
> groupNeighborEntry.getKey() + timingScale + neighborKey);
>
> hubKeyPut = new
> Put(Bytes.toBytes(hhNeighborRowKey));
>
> hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
> Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
> puts.add(hubKeyPut);
>
> groupKeyPut = new
> Put(Bytes.toBytes(hhNeighborRowKey));
>
> groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
> Bytes.toBytes(groupNeighborEntry.getKey()));
> puts.add(groupKeyPut);
>
> topGroupKeyPut = new
> Put(Bytes.toBytes(hhNeighborRowKey));
>
> topGroupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TOP_GROUP_KEY_COLUMN),
> Bytes.toBytes(GroupRegistry.WWW().GetParentGroupKey(groupNeighborEntry.getKey(;
> puts.add(topGroupKeyPut);
>
> timingScalePut = new
> Put(Bytes.toBytes(hhNeighborRowKey));
>
> timingScalePut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TIMING_SCALE_COLUMN),
> Bytes.toBytes(timingScale));
> puts.add(timingScalePut);
>
> nodeKeyPut = new
> Put(Bytes.toBytes(hhNeighborRowKey));
>
> nodeKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_NODE_KEY_COLUMN),
> Bytes.toBytes(neighborKey));
> puts.add(nodeKeyPut);
>
> hubNeighborTypePut = new
> Put(Bytes.toBytes(hhNeighborRowKey));
>
> hubNeighborTypePut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
> Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TYPE_COLUMN),
> Bytes.toBytes(SocialRole.VIRTUAL_NEIGHBOR));
> puts.add(hubNeighborTypePut);
> }
> }
> }
>
> try
> {
> this.neighborTable.put(puts);
> }
> catch (IOException e)
> {
> e.printStackTrace();
> }
> }
> ..
>
> Thanks so much!
>
> Best regards,
> Bing
>


HBase Is So Slow To Save Data?

2012-08-29 Thread Bing Li
Dear all,

According to my experiences, it is very slow for HBase to save data? Am I
right?

For example, today I need to save data in a HashMap to HBase. It took about
more than three hours. However when saving the same HashMap in a file in
the text format with the redirected System.out, it took only 4.5 seconds!

Why is HBase so slow? It is indexing?

My code to save data in HBase is as follows. I think the code must be
correct.

..
public synchronized void
AddVirtualOutgoingHHNeighbors(ConcurrentHashMap>> hhOutNeighborMap, int timingScale)
{
List puts = new ArrayList();

String hhNeighborRowKey;
Put hubKeyPut;
Put groupKeyPut;
Put topGroupKeyPut;
Put timingScalePut;
Put nodeKeyPut;
Put hubNeighborTypePut;

for (Map.Entry>> sourceHubGroupNeighborEntry : hhOutNeighborMap.entrySet())
{
for (Map.Entry>
groupNeighborEntry : sourceHubGroupNeighborEntry.getValue().entrySet())
{
for (String neighborKey :
groupNeighborEntry.getValue())
{
hhNeighborRowKey =
NeighborStructure.HUB_HUB_NEIGHBOR_ROW +
Tools.GetAHash(sourceHubGroupNeighborEntry.getKey() +
groupNeighborEntry.getKey() + timingScale + neighborKey);

hubKeyPut = new
Put(Bytes.toBytes(hhNeighborRowKey));

hubKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_HUB_KEY_COLUMN),
Bytes.toBytes(sourceHubGroupNeighborEntry.getKey()));
puts.add(hubKeyPut);

groupKeyPut = new
Put(Bytes.toBytes(hhNeighborRowKey));

groupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_GROUP_KEY_COLUMN),
Bytes.toBytes(groupNeighborEntry.getKey()));
puts.add(groupKeyPut);

topGroupKeyPut = new
Put(Bytes.toBytes(hhNeighborRowKey));

topGroupKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TOP_GROUP_KEY_COLUMN),
Bytes.toBytes(GroupRegistry.WWW().GetParentGroupKey(groupNeighborEntry.getKey(;
puts.add(topGroupKeyPut);

timingScalePut = new
Put(Bytes.toBytes(hhNeighborRowKey));

timingScalePut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TIMING_SCALE_COLUMN),
Bytes.toBytes(timingScale));
puts.add(timingScalePut);

nodeKeyPut = new
Put(Bytes.toBytes(hhNeighborRowKey));

nodeKeyPut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_NODE_KEY_COLUMN),
Bytes.toBytes(neighborKey));
puts.add(nodeKeyPut);

hubNeighborTypePut = new
Put(Bytes.toBytes(hhNeighborRowKey));

hubNeighborTypePut.add(Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_FAMILY),
Bytes.toBytes(NeighborStructure.HUB_HUB_NEIGHBOR_TYPE_COLUMN),
Bytes.toBytes(SocialRole.VIRTUAL_NEIGHBOR));
puts.add(hubNeighborTypePut);
}
}
}

try
{
this.neighborTable.put(puts);
}
catch (IOException e)
{
e.printStackTrace();
}
}
..

Thanks so much!

Best regards,
Bing


Re: setTimeRange and setMaxVersions seem to be inefficient

2012-08-29 Thread Jerry Lam
Hi Lars:

Thanks for spending time discussing this with me. I appreciate it.

I tried to implement the setMaxVersions(1) inside the filter as follows:

@Override
public ReturnCode filterKeyValue(KeyValue kv) {

// check if the same qualifier as the one that has been included
previously. If yes, jump to next column
if (previousIncludedQualifier != null &&
Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
previousIncludedQualifier = null;
return ReturnCode.NEXT_COL;
}
// another condition that makes the jump further using HINT
if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
LOG.info("Matched Found.");
return ReturnCode.SEEK_NEXT_USING_HINT;

}
// include this to the result and keep track of the included
qualifier so the next version of the same qualifier will be excluded
previousIncludedQualifier = kv.getQualifier();
return ReturnCode.INCLUDE;
}

Does this look reasonable or there is a better way to achieve this? It
would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though.

Best Regards,

Jerry


On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl  wrote:

> Hi Jerry,
>
> my answer will be the same again:
> Some folks will want the max versions set by the client to be before
> filters and some folks will want it to restrict the end result.
> It's not possible to have it both ways. Your filter needs to do the right
> thing.
>
>
> There's a lot of discussion around this in HBASE-5104.
>
>
> -- Lars
>
>
>
> 
>  From: Jerry Lam 
> To: user@hbase.apache.org; lars hofhansl 
> Sent: Tuesday, August 28, 2012 1:52 PM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> I see. Please refer to the inline comment below.
>
> Best Regards,
>
> Jerry
>
> On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl 
> wrote:
>
> > What I was saying was: It depends. :)
> >
> > First off, how do you get to 1000 versions? In 0.94++ older version are
> > pruned upon flush, so you need 333 flushes (assuming 3 versions on the
> CF)
> > to get 1000 versions.
> >
>
> I forgot that the default number of version to keep is 3. If this is what
> people use most of the time, yes you are right for this type of scenarios
> where the number of version per column to keep is small.
>
> By that time some compactions will have happened and you're back to close
> > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you
> > have).
> >
> > Now, if you have that many version because because you set VERSIONS=>1000
> > in your CF... Then imagine you have 100 columns with 1000 versions each.
> >
>
> Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> versioning myself)
>
> In your scenario below you'd do 10 comparisons if the filter would be
> > evaluated after the version counting. But only 1100 with the current
> code.
> > (or at least in that ball park)
> >
>
> This is where I don't quite understand what you mean.
>
> if the framework counts the number of ReturnCode.INCLUDE and then stops
> feeding the KeyValue into the filterKeyValue method after it reaches the
> count specified in setMaxVersions (i.e. 1 for the case we discussed),
> should then be just 100 comparisons only (at most) instead of 1100
> comparisons? Maybe I don't understand how the current way is doing...
>
>
>
> >
> > The gist is: One can construct scenarios where one approach is better
> than
> > the other. Only one order is possible.
> > If you write a custom filter and you care about these things you should
> > use the seek hints.
> >
> > -- Lars
> >
> >
> > - Original Message -
> > From: Jerry Lam 
> > To: user@hbase.apache.org; lars hofhansl 
> > Cc:
> > Sent: Tuesday, August 28, 2012 7:17 AM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > Thanks for the reply.
> > I need to understand if I misunderstood the perceived inefficiency
> because
> > it seems you don't think quite the same.
> >
> > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2)
> in a
> > table and each column has 1000 versions. Using the following code (the
> code
> > might have errors and don't compile):
> > /**
> > * This is very simple use case of a ColumnPrefixFilter.
> > * In fact all other filters that make use of filterKeyValue will see
> > similar
> > * performance problems that I have concerned with when the number of
> > * versions per column could be huge.
> >
> > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> > Scan scan = new Scan();
> > scan.setFilter(filter);
> > ResultScanner scanner = table.getScanner(scan);
> > for (Result result : scanner) {
> > for (KeyValue kv : result.raw()) {
> > System.out.println("KV: " + kv + ", Value: " +
> > Bytes.toString(kv.getValue()));
> > }
> > }
> > scanner.close();
> > */
> >
> > Implicitly, the number of version per column that is going to return is 1
> > (the latest version). User mig

Re: Timeseries data

2012-08-29 Thread Christian Schäfer
Like Mohit suggests I also would create rows where all events for a certain 
milliseconds or second are contained (as nested entities)..

Due to this time based grouping/aggregation/batching (aka timeboxing), each row 
is like an event bag for all events that occured in a certain millisecond.

Btw: grouping the puts on a millisecond or second basis (or better bit more) 
would decrease pressure on hbase because of fewer RPC-requests.

kind regards,
Chris


- Ursprüngliche Message -
Von: Mohit Anchlia 
An: user@hbase.apache.org
CC: 
Gesendet: 2:54 Mittwoch, 29.August 2012
Betreff: Re: Timeseries data

How does it deal with multiple writes in the same milliseconds for the same
rowkey/column? I can't see that info.

On Tue, Aug 28, 2012 at 5:33 PM, Marcos Ortiz  wrote:

> Study the OpenTSDB at StumbleUpon described  by Benoit "tsuna" Sigoure (
> ts...@stumbleupon.com) in the
> HBaseCon talk called "Lessons Learned from OpenTSDB".
> His team have done a great job working with Time-series data, and he gave
> a lot of great advices to work with this kind of data with HBase:
> - Wider rows to seek faster
> - Use asynchbase + Netty or Finagle(great tool created by Twitter
> engineers to work with HBase) = performance ++
> - Make writes idempotent and independent
>    before: start rows at arbitrary points in time
>    after: align rows on 10m (then 1h) boundaries
> - Store more data per Key/Value
> - Compact your data
> - Use short family names
> Best wishes
> El 28/08/2012 20:21, Mohit Anchlia escribió:
>
>> In timeseries type data how do people deal with scenarios where one might
>> get multiple events in a millisecond? Using nano second approach seems
>> tricky. Other option is to take advantage of versions or counters.
>>
>>
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>
>> http://www.uci.cu
>> http://www.facebook.com/**universidad.uci
>> http://www.flickr.com/photos/**universidad_uci
>>
>
>
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/**universidad.uci
> http://www.flickr.com/photos/**universidad_uci
>