Re: Doubt in HBase

2009-08-21 Thread Ryan Rawson
Well the inputs to those reducers would be the empty set, they
wouldn't have anything to do and their output would also be nil as
well.

If you are doing something like this, and your operation is
communitive, consider using a combiner so that you don't shuffle as
much data. A large amount of shuffled data can make map-reduces
slower. While map-reduce is a sorter, shuffling 1500gb just takes a
little while you know?

you can also set the # of reducers as well. but the mapping of reduce
keys to reducer instances is random/hashed iirc.  The normative case
however is to a large number of reduce keys, rather than only a small
amount.

Generally speaking, use the combiner functionality. It keeps the data
sizes low.  High reduce counts is better for when you have to shuffle
a lot of data with many distinct reduce keys.

This is getting pretty OT, I suggest revisiting the map-reduce paper
and the hadoop docs.

-ryan

On Thu, Aug 20, 2009 at 9:24 PM, john smithjs1987.sm...@gmail.com wrote:
 Thanks for all your replies guys ,.As bharath said , what is the case when
 number of reducers becomes more than number of distinct Map key outputs?

 On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada 
 bharathvissapragada1...@gmail.com wrote:

 Aamandeep , Gray and Purtell thanks for your replies .. I have found them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number of
 reduce tasks is more than number of distinct map output keys , some of the
 reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key on
 one
 region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the
 region
 with 5 values instead of moving those 5 values to other system to minimize
 the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org
 wrote:

  The behavior of TableInputFormat is to schedule one mapper for every
 table
  region.
 
  In addition to what others have said already, if your reducer is doing
  little more than storing data back into HBase (via TableOutputFormat),
 then
  you can consider writing results back to HBase directly from the mapper
 to
  avoid incurring the overhead of sort/shuffle/merge which happens within
 the
  Hadoop job framework as map outputs are input into reducers. For that
 type
  of use case -- using the Hadoop mapreduce subsystem as essentially a grid
  scheduler -- something like job.setNumReducers(0) will do the trick.
 
  Best regards,
 
    - Andy
 
 
 
 
  
  From: john smith js1987.sm...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Friday, August 21, 2009 12:42:36 AM
  Subject: Doubt in HBase
 
  Hi all ,
 
  I have one small doubt . Kindly answer it even if it sounds silly.
 
  Iam using Map Reduce in HBase in distributed mode .  I have a table which
  spans across 5 region servers . I am using TableInputFormat to read the
  data
  from the tables in the map . When i run the program , by default how many
  map regions are created ? Is it one per region server or more ?
 
  Also after the map task is over.. reduce task is taking a bit more time .
  Is
  it due to moving the map output across the regionservers? i.e, moving the
  values of same key to a particular reduce phase to start the reducer? Is
  there any way i can optimize the code (e.g. by storing data of same
 reducer
  nearby )
 
  Thanks :)
 
 
 
 




Re: Doubt in HBase

2009-08-21 Thread Ryan Rawson
hey,

Yes the hadoop system attempts to assign map tasks to data local, but
why would you be worried about this for 5 values?  The max value size
in hbase is Integer.MAX_VALUE, so it's not like you have much data to
shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
use HDFS directly and keep only the metadata in hbase (including
things like location of the data blob).

I think people are confused about how optimal map reduces have to be.
Keeping all the data super-local on each machine is not always helping
you, since you have to read via a socket anyways. Going remote doesn't
actually make things that much slower, since on a modern lan ping
times are  0.1ms.  If your entire cluster is hanging off a single
switch, there is nearly unlimited bandwidth between all nodes
(certainly much higher than any single system could push).  Only once
you go multi-switch then switch-locality (aka rack locality) becomes
important.

Remember, hadoop isn't about the instantaneous speed of any job, but
about running jobs in a highly scalable manner that works on tens or
tens of thousands of nodes. You end up blocking on single machine
limits anyways, and the r=3 of HDFS helps you transcend a single
machine read speed for large files. Keeping the data transfer local in
this case results in lower performance.

If you want max local speed, I suggest looking at CUDA.


On Thu, Aug 20, 2009 at 9:09 PM, bharath
vissapragadabharathvissapragada1...@gmail.com wrote:
 Aamandeep , Gray and Purtell thanks for your replies .. I have found them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number of
 reduce tasks is more than number of distinct map output keys , some of the
 reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key on one
 region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the region
 with 5 values instead of moving those 5 values to other system to minimize
 the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote:

 The behavior of TableInputFormat is to schedule one mapper for every table
 region.

 In addition to what others have said already, if your reducer is doing
 little more than storing data back into HBase (via TableOutputFormat), then
 you can consider writing results back to HBase directly from the mapper to
 avoid incurring the overhead of sort/shuffle/merge which happens within the
 Hadoop job framework as map outputs are input into reducers. For that type
 of use case -- using the Hadoop mapreduce subsystem as essentially a grid
 scheduler -- something like job.setNumReducers(0) will do the trick.

 Best regards,

   - Andy




 
 From: john smith js1987.sm...@gmail.com
 To: hbase-user@hadoop.apache.org
 Sent: Friday, August 21, 2009 12:42:36 AM
 Subject: Doubt in HBase

 Hi all ,

 I have one small doubt . Kindly answer it even if it sounds silly.

 Iam using Map Reduce in HBase in distributed mode .  I have a table which
 spans across 5 region servers . I am using TableInputFormat to read the
 data
 from the tables in the map . When i run the program , by default how many
 map regions are created ? Is it one per region server or more ?

 Also after the map task is over.. reduce task is taking a bit more time .
 Is
 it due to moving the map output across the regionservers? i.e, moving the
 values of same key to a particular reduce phase to start the reducer? Is
 there any way i can optimize the code (e.g. by storing data of same reducer
 nearby )

 Thanks :)







Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Nguyen Thi Ngoc Huong
How can I configure the location of the hbase directory? I configured
hbase-site.xml as follow:

property
namehbase.rootdir/name
valuefile:///temp/hbase-${user.name}/hbase/value
descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

and the log file is
Not starting HMaster because:
java.io.IOException: Mkdirs failed to create file:/temp/hbase-huongntn/hbase
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
at
org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
at
org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
2009-08-21 13:35:24,163 ERROR org.apache.hadoop.hbase.master.HMaster: Can
not start master
java.io.IOException: Mkdirs failed to create file:/temp/hbase-huongntn/hbase
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
at
org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
at
org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)


2009/8/21 Amandeep Khurana ama...@gmail.com



 You configure the location of the hbase directory in the hbase-site.xml

 The data being lost could have multilple reasons. To rule out the
 basic one - where have you pointed the hdfs to store data? If its
 going into /tmp, you'll lose data everytime the tmp cleaner comes into
 action.

 On 8/20/09, Nguyen Thi Ngoc Huong huongn...@gmail.com wrote:
  Hi all,
  I am a beginner to HBase. I have some question with Hbase after setup
 Hbase
  and Hadoop.
 
  The first, After setup Hbase and create a new database, I don't know
 where
  is location of HBase's database (database' s files) on the hard disk. At
 the
  first, I think it is on the hbase.rootdir directory, however, when I
 delete
  directory hbase.rootdir, and type the command list, all of database
  exist.
 
  The second, after restart computer and restart hbase, all database of
 HBase
  is lost. Is it always true? Or did I configure wrong? How can i configure
  Hbase to save  database after restart computer?
 
  --
  Nguyễn Thị Ngọc Hương
 


 --


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz




-- 
Nguyễn Thị Ngọc Hương


Re: Doubt in HBase

2009-08-21 Thread bharath vissapragada
Thanks Ryan

I was just explaining with an example .. I have TBs of data to work
with.Just i wanted to know that scheduler TRIES to assign the reduce phase
to keep the data local (i.e.,TRYING  to assign it to the machine with
machine with greater num of key values).
I was just explaining it with an example .

Thanks for ur reply (following u on twitter :))

On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:

 hey,

 Yes the hadoop system attempts to assign map tasks to data local, but
 why would you be worried about this for 5 values?  The max value size
 in hbase is Integer.MAX_VALUE, so it's not like you have much data to
 shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
 use HDFS directly and keep only the metadata in hbase (including
 things like location of the data blob).

 I think people are confused about how optimal map reduces have to be.
 Keeping all the data super-local on each machine is not always helping
 you, since you have to read via a socket anyways. Going remote doesn't
 actually make things that much slower, since on a modern lan ping
 times are  0.1ms.  If your entire cluster is hanging off a single
 switch, there is nearly unlimited bandwidth between all nodes
 (certainly much higher than any single system could push).  Only once
 you go multi-switch then switch-locality (aka rack locality) becomes
 important.

 Remember, hadoop isn't about the instantaneous speed of any job, but
 about running jobs in a highly scalable manner that works on tens or
 tens of thousands of nodes. You end up blocking on single machine
 limits anyways, and the r=3 of HDFS helps you transcend a single
 machine read speed for large files. Keeping the data transfer local in
 this case results in lower performance.

 If you want max local speed, I suggest looking at CUDA.


 On Thu, Aug 20, 2009 at 9:09 PM, bharath
 vissapragadabharathvissapragada1...@gmail.com wrote:
  Aamandeep , Gray and Purtell thanks for your replies .. I have found them
  very useful.
 
  You said to increase the number of reduce tasks . Suppose the number of
  reduce tasks is more than number of distinct map output keys , some of
 the
  reduce processes may go waste ? is that the case?
 
  Also  I have one more doubt ..I have 5 values for a corresponding key on
 one
  region  and other 2 values on 2 different region servers.
  Does hadoop Map reduce take care of moving these 2 diff values to the
 region
  with 5 values instead of moving those 5 values to other system to
 minimize
  the dataflow? Is this what is happening inside ?
 
  On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org
 wrote:
 
  The behavior of TableInputFormat is to schedule one mapper for every
 table
  region.
 
  In addition to what others have said already, if your reducer is doing
  little more than storing data back into HBase (via TableOutputFormat),
 then
  you can consider writing results back to HBase directly from the mapper
 to
  avoid incurring the overhead of sort/shuffle/merge which happens within
 the
  Hadoop job framework as map outputs are input into reducers. For that
 type
  of use case -- using the Hadoop mapreduce subsystem as essentially a
 grid
  scheduler -- something like job.setNumReducers(0) will do the trick.
 
  Best regards,
 
- Andy
 
 
 
 
  
  From: john smith js1987.sm...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Friday, August 21, 2009 12:42:36 AM
  Subject: Doubt in HBase
 
  Hi all ,
 
  I have one small doubt . Kindly answer it even if it sounds silly.
 
  Iam using Map Reduce in HBase in distributed mode .  I have a table
 which
  spans across 5 region servers . I am using TableInputFormat to read the
  data
  from the tables in the map . When i run the program , by default how
 many
  map regions are created ? Is it one per region server or more ?
 
  Also after the map task is over.. reduce task is taking a bit more time
 .
  Is
  it due to moving the map output across the regionservers? i.e, moving
 the
  values of same key to a particular reduce phase to start the reducer? Is
  there any way i can optimize the code (e.g. by storing data of same
 reducer
  nearby )
 
  Thanks :)
 
 
 
 
 



Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Amandeep Khurana
On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong huongn...@gmail.com
 wrote:

 How can I configure the location of the hbase directory? I configured
 hbase-site.xml as follow:

 property
namehbase.rootdir/name
value*file:///temp/hbase-${user.name}/hbase*/value
descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property


Thats the trouble.. Your data is being stored in the temp.. instead store it
in your hdfs.
so the value of the above property would be something like
*hdfs://namenodeserver:port/hbase*




 and the log file is
 Not starting HMaster because:
 java.io.IOException: Mkdirs failed to create
 file:/temp/hbase-huongntn/hbase
 at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
at
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
at
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
 2009-08-21 13:35:24,163 ERROR org.apache.hadoop.hbase.master.HMaster: Can
 not start master
 java.io.IOException: Mkdirs failed to create
 file:/temp/hbase-huongntn/hbase
at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
at
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
at
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)


 2009/8/21 Amandeep Khurana ama...@gmail.com



  You configure the location of the hbase directory in the hbase-site.xml
 
  The data being lost could have multilple reasons. To rule out the
  basic one - where have you pointed the hdfs to store data? If its
  going into /tmp, you'll lose data everytime the tmp cleaner comes into
  action.
 
  On 8/20/09, Nguyen Thi Ngoc Huong huongn...@gmail.com wrote:
   Hi all,
   I am a beginner to HBase. I have some question with Hbase after setup
  Hbase
   and Hadoop.
  
   The first, After setup Hbase and create a new database, I don't know
  where
   is location of HBase's database (database' s files) on the hard disk.
 At
  the
   first, I think it is on the hbase.rootdir directory, however, when I
  delete
   directory hbase.rootdir, and type the command list, all of database
   exist.
  
   The second, after restart computer and restart hbase, all database of
  HBase
   is lost. Is it always true? Or did I configure wrong? How can i
 configure
   Hbase to save  database after restart computer?
  
   --
   Nguyễn Thị Ngọc Hương
  
 
 
  --
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 



 --
 Nguyễn Thị Ngọc Hương



Re: HBase-0.20.0 multi read

2009-08-21 Thread y_823910

I have 3 PC cluster.(pc1 , pc2 , pc3)
Hadoop master (pc1), 2 slaves (pc2,pc3)

HBase and ZK running on pc1, two region servers (pc2,pc3)

pc1 : Intel core2 , 2.4GHz , RAM 1G

pc2 : Intel core2 , 2.4GHz , RAM 1G

pc3 : Intel core2 , 1.86GHZ, RAM 2G

---

hbase-env.sh
 export HBASE_MANAGES_ZK=true

---
configuration

 property
namehbase.cluster.distributed/name
valuetrue/value
descriptiontrue:fully-distributed with unmanaged Zookeeper Quorum
/description
  /property

  property
namehbase.rootdir/name
valuehdfs://convera:9000/hbase/value
descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

  property
namehbase.master/name
value10.42.253.182:6/value
descriptionThe host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver in
a single process.
/description
  /property
 property
namehbase.zookeeper.quorum/name
valueconvera/value
descriptionComma separated list of servers in the ZooKeeper Quorum.
For example,
host1.mydomain.com,host2.mydomain.com,host3.mydomain.com.
By default this is set to localhost for local and pseudo-distributed
modes
of operation. For a fully-distributed setup, this should be set to a
full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
/description
  /property

property
namehbase.zookeeper.property.maxClientCnxns/name
value30/value
descriptionProperty from ZooKeeper's config zoo.cfg.
Limit on number of concurrent connections (at the socket level) that a
single client, identified by IP address, may make to a single member of
the ZooKeeper ensemble. Set high to avoid zk connection issues running
standalone and pseudo-distributed.
/description
  /property

/configuration







  
  Amandeep Khurana  
  
  ama...@gmail.comTo:  
hbase-user@hadoop.apache.org
  
  cc:  (bcc: Y_823910/TSMC)
  
   Subject: Re: HBase-0.20.0 multi 
read   
  2009/08/21 11:54  
  
  AM
  
  Please respond to 
  
  hbase-user
  

  

  




You ideally want to have 3-5 servers outside the hbase servers... 1
server is not enough. That could to be causing you the trouble.

Post logs from the master and the region server where the read failed.

Also, what's your configuration? How many nodes, ram, cpus etc?

On 8/20/09, y_823...@tsmc.com y_823...@tsmc.com wrote:

 Hi there,

 It worked well while I fired 5 threads to fetch data from HBASE,but
 it failed after I incresed to 6 threads.
 Although it showed some WARN, the thread job can't be done!
 My hbase is the latest version hbase0.20.
 I want to test HBase multi read performance.
 Any suggestion?
 Thank you

 Fleming


 hbase-env.sh
export HBASE_MANAGES_ZK=true

 09/08/21 09:54:07 WARN zookeeper.ZooKeeperWrapper: Failed to create
/hbase
 -- check quorum servers, currently=10.42.253.182:2181
 org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /hbase
   at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
   at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:522)
   at


Re: HBase-0.20.0 multi read

2009-08-21 Thread Amandeep Khurana
On Fri, Aug 21, 2009 at 12:45 AM, y_823...@tsmc.com wrote:


 I have 3 PC cluster.(pc1 , pc2 , pc3)
 Hadoop master (pc1), 2 slaves (pc2,pc3)

 HBase and ZK running on pc1, two region servers (pc2,pc3)

 pc1 : Intel core2 , 2.4GHz , RAM 1G

 pc2 : Intel core2 , 2.4GHz , RAM 1G

 pc3 : Intel core2 , 1.86GHZ, RAM 2G


This is a very low config for HBase. I doubt if you'll be able to get a
remotely stable hbase instance going in this. More so, if you are trying to
test how much load it can take...



 ---

 hbase-env.sh
  export HBASE_MANAGES_ZK=true

 ---
 configuration

  property
namehbase.cluster.distributed/name
valuetrue/value
descriptiontrue:fully-distributed with unmanaged Zookeeper Quorum
/description
  /property

   property
namehbase.rootdir/name
valuehdfs://convera:9000/hbase/value
descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

  property
namehbase.master/name
value10.42.253.182:6/value
descriptionThe host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver in
a single process.
/description
  /property
  property
namehbase.zookeeper.quorum/name
valueconvera/value
 descriptionComma separated list of servers in the ZooKeeper Quorum.
For example,
 host1.mydomain.com,host2.mydomain.com,host3.mydomain.com.
By default this is set to localhost for local and pseudo-distributed
 modes
of operation. For a fully-distributed setup, this should be set to a
 full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
 hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
/description
  /property

 property
 namehbase.zookeeper.property.maxClientCnxns/name
value30/value
descriptionProperty from ZooKeeper's config zoo.cfg.
Limit on number of concurrent connections (at the socket level) that a
single client, identified by IP address, may make to a single member of
the ZooKeeper ensemble. Set high to avoid zk connection issues running
standalone and pseudo-distributed.
/description
  /property

 /configuration







  Amandeep Khurana
  ama...@gmail.comTo:
 hbase-user@hadoop.apache.org
  cc:  (bcc: Y_823910/TSMC)
   Subject: Re: HBase-0.20.0
 multi read
   2009/08/21 11:54
   AM
  Please respond to
  hbase-user






 You ideally want to have 3-5 servers outside the hbase servers... 1
 server is not enough. That could to be causing you the trouble.

 Post logs from the master and the region server where the read failed.

 Also, what's your configuration? How many nodes, ram, cpus etc?

 On 8/20/09, y_823...@tsmc.com y_823...@tsmc.com wrote:
 
  Hi there,
 
  It worked well while I fired 5 threads to fetch data from HBASE,but
  it failed after I incresed to 6 threads.
  Although it showed some WARN, the thread job can't be done!
  My hbase is the latest version hbase0.20.
  I want to test HBase multi read performance.
  Any suggestion?
  Thank you
 
  Fleming
 
 
  hbase-env.sh
 export HBASE_MANAGES_ZK=true
 
  09/08/21 09:54:07 WARN zookeeper.ZooKeeperWrapper: Failed to create
 /hbase
  -- check quorum servers, currently=10.42.253.182:2181
  org.apache.zookeeper.KeeperException$ConnectionLossException:
  KeeperErrorCode = ConnectionLoss for /hbase
at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:522)
at
 

 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:342)

at
 

 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:365)

at
 

 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:478)

at
 

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:846)

at
 

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)

at
 

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:491)

at
 

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:565)

at
 

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:524)

at
 

 

Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Nguyen Thi Ngoc Huong
Thanks you very much. I editted file hbase-site.xml as follow

property
namehbase.rootdir/name
valuehdfs://localhost:54310/hbase/value
descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

with fs.default.name is hdfs://localhost:54310
Now, I can see hbase database in Hadoop site manager (in hbase directory
not tmp directory in hdfs ).
However, when I restart my computer, I must restart hadoop (by command
./bin/hadoop format namenode and ./bin/start all) , restart hbase, and my
database is lost. What can I do to save my database?

2009/8/21 Amandeep Khurana ama...@gmail.com

 On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong 
 huongn...@gmail.com
  wrote:

  How can I configure the location of the hbase directory? I configured
  hbase-site.xml as follow:
 
  property
 namehbase.rootdir/name
 value*file:///temp/hbase-${user.name}/hbase*/value
 descriptionThe directory shared by region servers.
 Should be fully-qualified to include the filesystem to use.
 E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
 /description
   /property
 

 Thats the trouble.. Your data is being stored in the temp.. instead store
 it
 in your hdfs.
 so the value of the above property would be something like
 *hdfs://namenodeserver:port/hbase*



 
  and the log file is
  Not starting HMaster because:
  java.io.IOException: Mkdirs failed to create
  file:/temp/hbase-huongntn/hbase
  at
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
 at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
 at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
 at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
 at
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
 at
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
 at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
 at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
  2009-08-21 13:35:24,163 ERROR org.apache.hadoop.hbase.master.HMaster: Can
  not start master
  java.io.IOException: Mkdirs failed to create
  file:/temp/hbase-huongntn/hbase
 at
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
 at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
 at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
 at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
 at
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
 at
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
 at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
 at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
 
 
  2009/8/21 Amandeep Khurana ama...@gmail.com
 
 
 
   You configure the location of the hbase directory in the hbase-site.xml
  
   The data being lost could have multilple reasons. To rule out the
   basic one - where have you pointed the hdfs to store data? If its
   going into /tmp, you'll lose data everytime the tmp cleaner comes into
   action.
  
   On 8/20/09, Nguyen Thi Ngoc Huong huongn...@gmail.com wrote:
Hi all,
I am a beginner to HBase. I have some question with Hbase after setup
   Hbase
and Hadoop.
   
The first, After setup Hbase and create a new database, I don't know
   where
is location of HBase's database (database' s files) on the hard disk.
  At
   the
first, I think it is on the hbase.rootdir directory, however, when I
   delete
directory hbase.rootdir, and type the command list, all of database
exist.
   
The second, after restart computer and restart hbase, all database of
   HBase
is lost. Is it always true? Or did I configure wrong? How can i
  configure
Hbase to save  database after restart computer?
   
--
Nguyễn Thị Ngọc Hương
   
  
  
   --
  
  
   Amandeep Khurana
   Computer Science Graduate Student
   University of California, Santa Cruz
  
 
 
 
  --
  Nguyễn Thị Ngọc Hương
 




-- 
Nguyễn Thị Ngọc Hương


Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Amandeep Khurana
On Fri, Aug 21, 2009 at 1:03 AM, Nguyen Thi Ngoc Huong
huongn...@gmail.comwrote:

 Thanks you very much. I editted file hbase-site.xml as follow

 property
namehbase.rootdir/name
 valuehdfs://localhost:54310/hbase/value
 descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

 with fs.default.name is hdfs://localhost:54310
 Now, I can see hbase database in Hadoop site manager (in hbase directory
 not tmp directory in hdfs ).
 However, when I restart my computer, I must restart hadoop (by command
 ./bin/hadoop format namenode and ./bin/start all) , restart hbase, and my
 database is lost. What can I do to save my database?


You dont need to format the namenode everytime.. Just bin/start-all.sh




 2009/8/21 Amandeep Khurana ama...@gmail.com

  On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong 
  huongn...@gmail.com
   wrote:
 
   How can I configure the location of the hbase directory? I configured
   hbase-site.xml as follow:
  
   property
  namehbase.rootdir/name
  value*file:///temp/hbase-${user.name}/hbase*/value
  descriptionThe directory shared by region servers.
  Should be fully-qualified to include the filesystem to use.
  E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
  /description
/property
  
 
  Thats the trouble.. Your data is being stored in the temp.. instead store
  it
  in your hdfs.
  so the value of the above property would be something like
  *hdfs://namenodeserver:port/hbase*
 
 
 
  
   and the log file is
   Not starting HMaster because:
   java.io.IOException: Mkdirs failed to create
   file:/temp/hbase-huongntn/hbase
   at
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
  at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
  at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
  at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
  at
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
  at
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
  at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
  at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
   2009-08-21 13:35:24,163 ERROR org.apache.hadoop.hbase.master.HMaster:
 Can
   not start master
   java.io.IOException: Mkdirs failed to create
   file:/temp/hbase-huongntn/hbase
  at
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
  at org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
  at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
  at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
  at
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
  at
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
  at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
  at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
  
  
   2009/8/21 Amandeep Khurana ama...@gmail.com
  
  
  
You configure the location of the hbase directory in the
 hbase-site.xml
   
The data being lost could have multilple reasons. To rule out the
basic one - where have you pointed the hdfs to store data? If its
going into /tmp, you'll lose data everytime the tmp cleaner comes
 into
action.
   
On 8/20/09, Nguyen Thi Ngoc Huong huongn...@gmail.com wrote:
 Hi all,
 I am a beginner to HBase. I have some question with Hbase after
 setup
Hbase
 and Hadoop.

 The first, After setup Hbase and create a new database, I don't
 know
where
 is location of HBase's database (database' s files) on the hard
 disk.
   At
the
 first, I think it is on the hbase.rootdir directory, however, when
 I
delete
 directory hbase.rootdir, and type the command list, all of
 database
 exist.

 The second, after restart computer and restart hbase, all database
 of
HBase
 is lost. Is it always true? Or did I configure wrong? How can i
   configure
 Hbase to save  database after restart computer?

 --
 Nguyễn Thị Ngọc Hương

   
   
--
   
   
Amandeep Khurana
Computer Science 

Re: HBase-0.20.0 multi read

2009-08-21 Thread y_823910
You mean my PCs are not good enough to run HBase well ?
I've put 5 oracle tables to HBase successfully , the biggest table record
count is only 50,000.
Is there a client request limit for region server?
Two region server just serve 5 clients, it's a little strange!
Any suggestion hardware spec for HBase?
For that spec, how many clients can fetch data from HBase  concurrently?

Fleming





  
  Amandeep Khurana  
  
  ama...@gmail.comTo:  
hbase-user@hadoop.apache.org
  
  cc:  (bcc: Y_823910/TSMC)
  
   Subject: Re: HBase-0.20.0 multi 
read   
  2009/08/21 03:49  
  
  PM
  
  Please respond to 
  
  hbase-user
  

  

  




On Fri, Aug 21, 2009 at 12:45 AM, y_823...@tsmc.com wrote:


 I have 3 PC cluster.(pc1 , pc2 , pc3)
 Hadoop master (pc1), 2 slaves (pc2,pc3)

 HBase and ZK running on pc1, two region servers (pc2,pc3)

 pc1 : Intel core2 , 2.4GHz , RAM 1G

 pc2 : Intel core2 , 2.4GHz , RAM 1G

 pc3 : Intel core2 , 1.86GHZ, RAM 2G


This is a very low config for HBase. I doubt if you'll be able to get a
remotely stable hbase instance going in this. More so, if you are trying to
test how much load it can take...



 ---

 hbase-env.sh
  export HBASE_MANAGES_ZK=true

 ---
 configuration

  property
namehbase.cluster.distributed/name
valuetrue/value
descriptiontrue:fully-distributed with unmanaged Zookeeper Quorum
/description
  /property

   property
namehbase.rootdir/name
valuehdfs://convera:9000/hbase/value
descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

  property
namehbase.master/name
value10.42.253.182:6/value
descriptionThe host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver in
a single process.
/description
  /property
  property
namehbase.zookeeper.quorum/name
valueconvera/value
 descriptionComma separated list of servers in the ZooKeeper Quorum.
For example,
 host1.mydomain.com,host2.mydomain.com,host3.mydomain.com.
By default this is set to localhost for local and pseudo-distributed
 modes
of operation. For a fully-distributed setup, this should be set to a
 full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
 hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
/description
  /property

 property
 namehbase.zookeeper.property.maxClientCnxns/name
value30/value
descriptionProperty from ZooKeeper's config zoo.cfg.
Limit on number of concurrent connections (at the socket level) that a
single client, identified by IP address, may make to a single member
of
the ZooKeeper ensemble. Set high to avoid zk connection issues running
standalone and pseudo-distributed.
/description
  /property

 /configuration







  Amandeep Khurana
  ama...@gmail.comTo:
 hbase-user@hadoop.apache.org
  cc:  (bcc:
Y_823910/TSMC)
   Subject: Re: HBase-0.20.0
 multi read
   2009/08/21 11:54
   AM
  Please respond to
  hbase-user






 You ideally want to have 3-5 servers outside the hbase servers... 1
 server is not enough. That could 

Re: HBase-0.20.0 multi read

2009-08-21 Thread Amandeep Khurana
On Fri, Aug 21, 2009 at 1:12 AM, y_823...@tsmc.com wrote:

 You mean my PCs are not good enough to run HBase well ?


Thats right.. HBase is a RAM hogger. The nodes in my cluster have 8GB RAM
each and its low... I run into trouble because of that.



 I've put 5 oracle tables to HBase successfully , the biggest table record
 count is only 50,000.


Thats a small data set. Not much.



 Is there a client request limit for region server?


Good question. I dont have an answer straight away. However, I think its got
to be related to the RPC handlers. I'd wait for someone else to answer this
more correctly.



 Two region server just serve 5 clients, it's a little strange!
 Any suggestion hardware spec for HBase?
 For that spec, how many clients can fetch data from HBase  concurrently?


Depends on your use case. What are you trying to accomplish with HBase? In
any case, you would need about 8-9 nodes to have a stable setup.



 Fleming





  Amandeep Khurana
  ama...@gmail.comTo:
 hbase-user@hadoop.apache.org
  cc:  (bcc: Y_823910/TSMC)
   Subject: Re: HBase-0.20.0
 multi read
   2009/08/21 03:49
  PM
   Please respond to
  hbase-user






 On Fri, Aug 21, 2009 at 12:45 AM, y_823...@tsmc.com wrote:

 
  I have 3 PC cluster.(pc1 , pc2 , pc3)
  Hadoop master (pc1), 2 slaves (pc2,pc3)
 
  HBase and ZK running on pc1, two region servers (pc2,pc3)
 
  pc1 : Intel core2 , 2.4GHz , RAM 1G
 
  pc2 : Intel core2 , 2.4GHz , RAM 1G
 
  pc3 : Intel core2 , 1.86GHZ, RAM 2G
 

 This is a very low config for HBase. I doubt if you'll be able to get a
 remotely stable hbase instance going in this. More so, if you are trying to
 test how much load it can take...


 
  ---
 
  hbase-env.sh
   export HBASE_MANAGES_ZK=true
 
  ---
  configuration
 
   property
 namehbase.cluster.distributed/name
 valuetrue/value
 descriptiontrue:fully-distributed with unmanaged Zookeeper Quorum
 /description
   /property
 
property
 namehbase.rootdir/name
 valuehdfs://convera:9000/hbase/value
 descriptionThe directory shared by region servers.
 Should be fully-qualified to include the filesystem to use.
 E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
 /description
   /property
 
   property
 namehbase.master/name
 value10.42.253.182:6/value
 descriptionThe host and port that the HBase master runs at.
 A value of 'local' runs the master and a regionserver in
 a single process.
 /description
   /property
   property
 namehbase.zookeeper.quorum/name
 valueconvera/value
  descriptionComma separated list of servers in the ZooKeeper Quorum.
 For example,
  host1.mydomain.com,host2.mydomain.com,host3.mydomain.com.
 By default this is set to localhost for local and pseudo-distributed
  modes
 of operation. For a fully-distributed setup, this should be set to a
  full
 list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
  hbase-env.sh
 this is the list of servers which we will start/stop ZooKeeper on.
 /description
   /property
 
  property
  namehbase.zookeeper.property.maxClientCnxns/name
 value30/value
 descriptionProperty from ZooKeeper's config zoo.cfg.
 Limit on number of concurrent connections (at the socket level) that a
 single client, identified by IP address, may make to a single member
 of
 the ZooKeeper ensemble. Set high to avoid zk connection issues running
 standalone and pseudo-distributed.
 /description
   /property
 
  /configuration
 
 
 
 
 
 
 
   Amandeep Khurana
   ama...@gmail.comTo:
  hbase-user@hadoop.apache.org
   cc:  (bcc:
 Y_823910/TSMC)
Subject: Re: HBase-0.20.0
  multi read
2009/08/21 11:54
AM
   Please respond to
   hbase-user
 
 
 
 
 
 
  You ideally want to have 3-5 servers outside the hbase servers... 1
  server is not enough. That could to be causing you the trouble.
 
  Post logs from the master and the region server where the read failed.
 
  Also, what's your configuration? How many nodes, ram, cpus etc?
 
  On 8/20/09, y_823...@tsmc.com y_823...@tsmc.com wrote:
  
   Hi there,
  
   It worked well while I fired 5 threads to fetch data from HBASE,but
   it failed after I incresed to 6 threads.
   Although it showed some WARN, the thread job can't be done!
   My hbase is the latest version hbase0.20.
   I want to test HBase multi read performance.
   Any suggestion?
   Thank you
  
   Fleming
  
  
   

Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Nguyen Thi Ngoc Huong
You dont need to format the namenode everytime.. Just bin/start-all.sh

Really? Just bin/start-all.sh, namnode is not started (when I type command
jps, there are only 5 processes
3421 SecondaryNameNode
3492 JobTracker
3582 TaskTracker
4031 Jps
3325 DataNode, there isn't Namenode process)
and certainly, the page http://localhost:50070 is died and connection from
Hbase to hadoop is died, too


2009/8/21 Amandeep Khurana ama...@gmail.com

 On Fri, Aug 21, 2009 at 1:03 AM, Nguyen Thi Ngoc Huong
 huongn...@gmail.comwrote:

  Thanks you very much. I editted file hbase-site.xml as follow
 
  property
 namehbase.rootdir/name
  valuehdfs://localhost:54310/hbase/value
  descriptionThe directory shared by region servers.
 Should be fully-qualified to include the filesystem to use.
 E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
 /description
   /property
 
  with fs.default.name is hdfs://localhost:54310
  Now, I can see hbase database in Hadoop site manager (in hbase
 directory
  not tmp directory in hdfs ).
  However, when I restart my computer, I must restart hadoop (by command
  ./bin/hadoop format namenode and ./bin/start all) , restart hbase, and my
  database is lost. What can I do to save my database?
 

 You dont need to format the namenode everytime.. Just bin/start-all.sh



 
  2009/8/21 Amandeep Khurana ama...@gmail.com
 
   On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong 
   huongn...@gmail.com
wrote:
  
How can I configure the location of the hbase directory? I configured
hbase-site.xml as follow:
   
property
   namehbase.rootdir/name
   value*file:///temp/hbase-${user.name}/hbase*/value
   descriptionThe directory shared by region servers.
   Should be fully-qualified to include the filesystem to use.
   E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
   /description
 /property
   
  
   Thats the trouble.. Your data is being stored in the temp.. instead
 store
   it
   in your hdfs.
   so the value of the above property would be something like
   *hdfs://namenodeserver:port/hbase*
  
  
  
   
and the log file is
Not starting HMaster because:
java.io.IOException: Mkdirs failed to create
file:/temp/hbase-huongntn/hbase
at
   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
   at
 org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
   at
   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
   at
   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
   at
 org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
   at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
2009-08-21 13:35:24,163 ERROR org.apache.hadoop.hbase.master.HMaster:
  Can
not start master
java.io.IOException: Mkdirs failed to create
file:/temp/hbase-huongntn/hbase
   at
   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
   at
 org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
   at
   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
   at
   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
   at
 org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
   at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
   
   
2009/8/21 Amandeep Khurana ama...@gmail.com
   
   
   
 You configure the location of the hbase directory in the
  hbase-site.xml

 The data being lost could have multilple reasons. To rule out the
 basic one - where have you pointed the hdfs to store data? If its
 going into /tmp, you'll lose data everytime the tmp cleaner comes
  into
 action.

 On 8/20/09, Nguyen Thi Ngoc Huong huongn...@gmail.com wrote:
  Hi all,
  I am a beginner to HBase. I have some question with Hbase after
  setup
 Hbase
  and Hadoop.
 
  The first, After setup Hbase and create a new database, I don't
  know
 where
  is location of 

Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Amandeep Khurana
1. If you have formatted your namenode before starting the first time, thats
all thats needed.

2. To start from scratch, delete everything thats there in the directory
where you are pointing your hdfs to; format namenode again; start all

3. If it still doesnt work, look at the namenode logs to see whats
happening. Post it here if you cant figure it out.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Aug 21, 2009 at 1:30 AM, Nguyen Thi Ngoc Huong
huongn...@gmail.comwrote:

 You dont need to format the namenode everytime.. Just bin/start-all.sh

 Really? Just bin/start-all.sh, namnode is not started (when I type command
 jps, there are only 5 processes
 3421 SecondaryNameNode
 3492 JobTracker
 3582 TaskTracker
 4031 Jps
 3325 DataNode, there isn't Namenode process)
 and certainly, the page http://localhost:50070 is died and connection from
 Hbase to hadoop is died, too


 2009/8/21 Amandeep Khurana ama...@gmail.com

  On Fri, Aug 21, 2009 at 1:03 AM, Nguyen Thi Ngoc Huong
  huongn...@gmail.comwrote:
 
   Thanks you very much. I editted file hbase-site.xml as follow
  
   property
  namehbase.rootdir/name
   valuehdfs://localhost:54310/hbase/value
   descriptionThe directory shared by region servers.
  Should be fully-qualified to include the filesystem to use.
  E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
  /description
/property
  
   with fs.default.name is hdfs://localhost:54310
   Now, I can see hbase database in Hadoop site manager (in hbase
  directory
   not tmp directory in hdfs ).
   However, when I restart my computer, I must restart hadoop (by command
   ./bin/hadoop format namenode and ./bin/start all) , restart hbase, and
 my
   database is lost. What can I do to save my database?
  
 
  You dont need to format the namenode everytime.. Just bin/start-all.sh
 
 
 
  
   2009/8/21 Amandeep Khurana ama...@gmail.com
  
On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong 
huongn...@gmail.com
 wrote:
   
 How can I configure the location of the hbase directory? I
 configured
 hbase-site.xml as follow:

 property
namehbase.rootdir/name
value*file:///temp/hbase-${user.name}/hbase*/value
descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

   
Thats the trouble.. Your data is being stored in the temp.. instead
  store
it
in your hdfs.
so the value of the above property would be something like
*hdfs://namenodeserver:port/hbase*
   
   
   

 and the log file is
 Not starting HMaster because:
 java.io.IOException: Mkdirs failed to create
 file:/temp/hbase-huongntn/hbase
 at

   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at
  org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
at
 org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
at
 org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
at

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
at

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
at
  org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
at
 org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
 2009-08-21 13:35:24,163 ERROR
 org.apache.hadoop.hbase.master.HMaster:
   Can
 not start master
 java.io.IOException: Mkdirs failed to create
 file:/temp/hbase-huongntn/hbase
at

   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at
  org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
at
 org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
at
 org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
at

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
at

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
at
  org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
at
 org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)


 

Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Nguyen Thi Ngoc Huong
Thank you very much.
I deleted everything and configured hadoop.tmp.dir property in
hadoop-site.xml as follow
property
  namehadoop.tmp.dir/name
  value/home/huongntn/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
/property

After that, I formated namenode and start-all. When I restarted my computer
and typed the command start-all, hadoop work smoothly. I start hbase by
command ./bin/start-hbase.sh and ./hbase shell

Now  I can't see my database in hbase shell (by command list) although I
can see it in Hadoop site manager,


2009/8/21 Amandeep Khurana ama...@gmail.com

 1. If you have formatted your namenode before starting the first time,
 thats
 all thats needed.

 2. To start from scratch, delete everything thats there in the directory
 where you are pointing your hdfs to; format namenode again; start all

 3. If it still doesnt work, look at the namenode logs to see whats
 happening. Post it here if you cant figure it out.


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Fri, Aug 21, 2009 at 1:30 AM, Nguyen Thi Ngoc Huong
 huongn...@gmail.comwrote:

  You dont need to format the namenode everytime.. Just bin/start-all.sh
 
  Really? Just bin/start-all.sh, namnode is not started (when I type
 command
  jps, there are only 5 processes
  3421 SecondaryNameNode
  3492 JobTracker
  3582 TaskTracker
  4031 Jps
  3325 DataNode, there isn't Namenode process)
  and certainly, the page http://localhost:50070 is died and connection
 from
  Hbase to hadoop is died, too
 
 
  2009/8/21 Amandeep Khurana ama...@gmail.com
 
   On Fri, Aug 21, 2009 at 1:03 AM, Nguyen Thi Ngoc Huong
   huongn...@gmail.comwrote:
  
Thanks you very much. I editted file hbase-site.xml as follow
   
property
   namehbase.rootdir/name
valuehdfs://localhost:54310/hbase/value
descriptionThe directory shared by region servers.
   Should be fully-qualified to include the filesystem to use.
   E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
   /description
 /property
   
with fs.default.name is hdfs://localhost:54310
Now, I can see hbase database in Hadoop site manager (in hbase
   directory
not tmp directory in hdfs ).
However, when I restart my computer, I must restart hadoop (by
 command
./bin/hadoop format namenode and ./bin/start all) , restart hbase,
 and
  my
database is lost. What can I do to save my database?
   
  
   You dont need to format the namenode everytime.. Just bin/start-all.sh
  
  
  
   
2009/8/21 Amandeep Khurana ama...@gmail.com
   
 On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong 
 huongn...@gmail.com
  wrote:

  How can I configure the location of the hbase directory? I
  configured
  hbase-site.xml as follow:
 
  property
 namehbase.rootdir/name
 value*file:///temp/hbase-${user.name}/hbase*/value
 descriptionThe directory shared by region servers.
 Should be fully-qualified to include the filesystem to use.
 E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
 /description
   /property
 

 Thats the trouble.. Your data is being stored in the temp.. instead
   store
 it
 in your hdfs.
 so the value of the above property would be something like
 *hdfs://namenodeserver:port/hbase*



 
  and the log file is
  Not starting HMaster because:
  java.io.IOException: Mkdirs failed to create
  file:/temp/hbase-huongntn/hbase
  at
 

   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
 at
   org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
 at
  org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
 at
  org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
 at
 

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
 at
 

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
 at
   org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
 at
  org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
  2009-08-21 13:35:24,163 ERROR
  org.apache.hadoop.hbase.master.HMaster:
Can
  not start master
  java.io.IOException: Mkdirs failed to create
  file:/temp/hbase-huongntn/hbase
 at
 

   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
 at 

Re: WARN zookeeper.ClientCnxn: Ignoring exception during shutdown output

2009-08-21 Thread Lucas Nazário dos Santos
Hi,

Notice that someone has already filed a bug (
http://issues.apache.org/jira/browse/HBASE-1645). Has anybody being able to
workaround it?

Thanks,
Lucas



On Thu, Aug 20, 2009 at 10:14 AM, Lucas Nazário dos Santos 
nazario.lu...@gmail.com wrote:

 Hi,

 Since I've migrated to HBase 0.20.0 RC1, the following error keeps
 happening. I have to kill HBase and start it again to recover from the
 exception. Does anybody know a workaround?

 Lucas


 09/08/20 10:09:01 WARN zookeeper.ClientCnxn: Ignoring exception during
 shutdown output
 java.net.SocketException: Transport endpoint is not connected
 at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
 at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
 at
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
 at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)



Re: WARN zookeeper.ClientCnxn: Ignoring exception during shutdown output

2009-08-21 Thread Jean-Daniel Cryans
Well like Stack said in the Jira:

This is not a biggy. Just noting it. No ZK, no HB.

So that means that your Zookeeper server gets killed for some reason
and if there's no ZK, there's no HBase. So what you should do is to
try to see why your ZK server is gone.

J-D

On Fri, Aug 21, 2009 at 7:52 AM, Lucas Nazário dos
Santosnazario.lu...@gmail.com wrote:
 Hi,

 Notice that someone has already filed a bug (
 http://issues.apache.org/jira/browse/HBASE-1645). Has anybody being able to
 workaround it?

 Thanks,
 Lucas



 On Thu, Aug 20, 2009 at 10:14 AM, Lucas Nazário dos Santos 
 nazario.lu...@gmail.com wrote:

 Hi,

 Since I've migrated to HBase 0.20.0 RC1, the following error keeps
 happening. I have to kill HBase and start it again to recover from the
 exception. Does anybody know a workaround?

 Lucas


 09/08/20 10:09:01 WARN zookeeper.ClientCnxn: Ignoring exception during
 shutdown output
 java.net.SocketException: Transport endpoint is not connected
         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
         at
 sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
         at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
         at
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
         at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)




Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Jean-Daniel Cryans
2 questions:

Which version of HBase are you using?

Are you stopping HBase and Hadoop when restarting your computer?

Thx,

J-D

On Fri, Aug 21, 2009 at 5:46 AM, Nguyen Thi Ngoc
Huonghuongn...@gmail.com wrote:
 Thank you very much.
 I deleted everything and configured hadoop.tmp.dir property in
 hadoop-site.xml as follow
 property
  namehadoop.tmp.dir/name
  value/home/huongntn/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
 /property

 After that, I formated namenode and start-all. When I restarted my computer
 and typed the command start-all, hadoop work smoothly. I start hbase by
 command ./bin/start-hbase.sh and ./hbase shell

 Now  I can't see my database in hbase shell (by command list) although I
 can see it in Hadoop site manager,


 2009/8/21 Amandeep Khurana ama...@gmail.com

 1. If you have formatted your namenode before starting the first time,
 thats
 all thats needed.

 2. To start from scratch, delete everything thats there in the directory
 where you are pointing your hdfs to; format namenode again; start all

 3. If it still doesnt work, look at the namenode logs to see whats
 happening. Post it here if you cant figure it out.


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Fri, Aug 21, 2009 at 1:30 AM, Nguyen Thi Ngoc Huong
 huongn...@gmail.comwrote:

  You dont need to format the namenode everytime.. Just bin/start-all.sh
 
  Really? Just bin/start-all.sh, namnode is not started (when I type
 command
  jps, there are only 5 processes
  3421 SecondaryNameNode
  3492 JobTracker
  3582 TaskTracker
  4031 Jps
  3325 DataNode, there isn't Namenode process)
  and certainly, the page http://localhost:50070 is died and connection
 from
  Hbase to hadoop is died, too
 
 
  2009/8/21 Amandeep Khurana ama...@gmail.com
 
   On Fri, Aug 21, 2009 at 1:03 AM, Nguyen Thi Ngoc Huong
   huongn...@gmail.comwrote:
  
Thanks you very much. I editted file hbase-site.xml as follow
   
property
   namehbase.rootdir/name
    valuehdfs://localhost:54310/hbase/value
    descriptionThe directory shared by region servers.
   Should be fully-qualified to include the filesystem to use.
   E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
   /description
 /property
   
with fs.default.name is hdfs://localhost:54310
Now, I can see hbase database in Hadoop site manager (in hbase
   directory
not tmp directory in hdfs ).
However, when I restart my computer, I must restart hadoop (by
 command
./bin/hadoop format namenode and ./bin/start all) , restart hbase,
 and
  my
database is lost. What can I do to save my database?
   
  
   You dont need to format the namenode everytime.. Just bin/start-all.sh
  
  
  
   
2009/8/21 Amandeep Khurana ama...@gmail.com
   
 On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong 
 huongn...@gmail.com
  wrote:

  How can I configure the location of the hbase directory? I
  configured
  hbase-site.xml as follow:
 
  property
     namehbase.rootdir/name
     value*file:///temp/hbase-${user.name}/hbase*/value
     descriptionThe directory shared by region servers.
     Should be fully-qualified to include the filesystem to use.
     E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
     /description
   /property
 

 Thats the trouble.. Your data is being stored in the temp.. instead
   store
 it
 in your hdfs.
 so the value of the above property would be something like
 *hdfs://namenodeserver:port/hbase*



 
  and the log file is
  Not starting HMaster because:
  java.io.IOException: Mkdirs failed to create
  file:/temp/hbase-huongntn/hbase
  at
 

   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
     at
   org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
     at
  org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
     at
  org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
     at
 

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
     at
 

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
     at
   org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1013)
     at
  org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1057)
  2009-08-21 13:35:24,163 ERROR
  org.apache.hadoop.hbase.master.HMaster:
Can
  not start master
  java.io.IOException: Mkdirs failed to create
  file:/temp/hbase-huongntn/hbase
   

Re: HBase-0.20.0 multi read

2009-08-21 Thread Jean-Daniel Cryans
No there's something else here, your machines are weak indeed but it
should be alright.

So you say you are fetching data from HBase... how? How big is each
row? What kind of load do you see on your machines? Lots of IO wait?

The error you pasted in the first mail seems to imply that your client
wasn't able to connect to Zookeeper. Is there a connectivity issue? Do
you see load on the machine hosting the ZK server? Did you monitor
that? Did you try to put a ZK process on each server (just add the
other machines in hbase.zookeeper.quorum)?

Without that kind of info it is hard to tell what's exactly wrong.

J-D

On Fri, Aug 21, 2009 at 4:12 AM, y_823...@tsmc.com wrote:
 You mean my PCs are not good enough to run HBase well ?
 I've put 5 oracle tables to HBase successfully , the biggest table record
 count is only 50,000.
 Is there a client request limit for region server?
 Two region server just serve 5 clients, it's a little strange!
 Any suggestion hardware spec for HBase?
 For that spec, how many clients can fetch data from HBase  concurrently?

 Fleming





                      Amandeep Khurana
                      ama...@gmail.com        To:      
 hbase-u...@hadoop.apache.org
                                              cc:      (bcc: Y_823910/TSMC)
                                               Subject: Re: HBase-0.20.0 multi 
 read
                      2009/08/21 03:49
                      PM
                      Please respond to
                      hbase-user






 On Fri, Aug 21, 2009 at 12:45 AM, y_823...@tsmc.com wrote:


 I have 3 PC cluster.(pc1 , pc2 , pc3)
 Hadoop master (pc1), 2 slaves (pc2,pc3)

 HBase and ZK running on pc1, two region servers (pc2,pc3)

 pc1 : Intel core2 , 2.4GHz , RAM 1G

 pc2 : Intel core2 , 2.4GHz , RAM 1G

 pc3 : Intel core2 , 1.86GHZ, RAM 2G


 This is a very low config for HBase. I doubt if you'll be able to get a
 remotely stable hbase instance going in this. More so, if you are trying to
 test how much load it can take...



 ---

 hbase-env.sh
  export HBASE_MANAGES_ZK=true

 ---
 configuration

  property
    namehbase.cluster.distributed/name
    valuetrue/value
    descriptiontrue:fully-distributed with unmanaged Zookeeper Quorum
    /description
  /property

   property
    namehbase.rootdir/name
    valuehdfs://convera:9000/hbase/value
    descriptionThe directory shared by region servers.
    Should be fully-qualified to include the filesystem to use.
    E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
    /description
  /property

  property
    namehbase.master/name
    value10.42.253.182:6/value
    descriptionThe host and port that the HBase master runs at.
    A value of 'local' runs the master and a regionserver in
    a single process.
    /description
  /property
  property
    namehbase.zookeeper.quorum/name
    valueconvera/value
     descriptionComma separated list of servers in the ZooKeeper Quorum.
    For example,
 host1.mydomain.com,host2.mydomain.com,host3.mydomain.com.
    By default this is set to localhost for local and pseudo-distributed
 modes
    of operation. For a fully-distributed setup, this should be set to a
 full
    list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
 hbase-env.sh
    this is the list of servers which we will start/stop ZooKeeper on.
    /description
  /property

 property
     namehbase.zookeeper.property.maxClientCnxns/name
    value30/value
    descriptionProperty from ZooKeeper's config zoo.cfg.
    Limit on number of concurrent connections (at the socket level) that a
    single client, identified by IP address, may make to a single member
 of
    the ZooKeeper ensemble. Set high to avoid zk connection issues running
    standalone and pseudo-distributed.
    /description
  /property

 /configuration







                      Amandeep Khurana
                      ama...@gmail.com        To:
 hbase-user@hadoop.apache.org
                                              cc:      (bcc:
 Y_823910/TSMC)
                                               Subject: Re: HBase-0.20.0
 multi read
                       2009/08/21 11:54
                       AM
                      Please respond to
                      hbase-user






 You ideally want to have 3-5 servers outside the hbase servers... 1
 server is not enough. That could to be causing you the trouble.

 Post logs from the master and the region server where the read failed.

 Also, what's your configuration? How many nodes, ram, cpus etc?

 On 8/20/09, y_823...@tsmc.com y_823...@tsmc.com wrote:
 
  Hi there,
 
  It worked well while I fired 5 threads to fetch data from HBASE,but
  it failed after I incresed to 6 threads.
  Although it showed some WARN, the thread job can't be done!
  My hbase is the latest version hbase0.20.
  I want to test HBase multi read performance.
  Any suggestion?
  Thank you
 
  Fleming
 

Re: HBase-0.20.0 multi read

2009-08-21 Thread Andrew Purtell
Hi,

This is far too little RAM and underpowered for CPU also. The rule of thumb is 
1GB RAM for system, 1GB RAM for each Hadoop daemon (HDFS, jobtracker, 
tasktracker, etc.), 1GB RAM for Zookeeper, 1GB RAM (but more if you want 
performance/caching) for HBase region servers; and 1 hardware core for each 
concurrent daemon progress. You won't go wrong with dual quad core. If you are 
running other processes colocated aside Hadoop and HBase daemons, you need to 
account for their heap in RAM and added CPU load also. Too few resources and 
heap swapped out will pause GC too long, or threads will be starved for CPU, 
and you'll see no end of trouble. To get perspective, consider a typical 
production deployment of this system involves a dedicated 2N+1 Zookeeper 
ensemble (N ~= 1...3), and a Hadoop+HBase stack on 10s or even 100s of nodes. 

   - Andy





From: y_823...@tsmc.com y_823...@tsmc.com
To: hbase-user@hadoop.apache.org
Sent: Friday, August 21, 2009 3:45:11 PM
Subject: Re: HBase-0.20.0 multi read


I have 3 PC cluster.(pc1 , pc2 , pc3)
Hadoop master (pc1), 2 slaves (pc2,pc3)

HBase and ZK running on pc1, two region servers (pc2,pc3)

pc1 : Intel core2 , 2.4GHz , RAM 1G

pc2 : Intel core2 , 2.4GHz , RAM 1G

pc3 : Intel core2 , 1.86GHZ, RAM 2G

[...]



  

Re: Story of my HBase Bugs / Feature Suggestions

2009-08-21 Thread Bradford Stephens
Sure, I've got a ton of logs. I'll try to grab what's most pertinent and put
them on rapidshare, but there will be a ton of data to sift through :)

On Thu, Aug 20, 2009 at 8:57 PM, Andrew Purtell apurt...@apache.org wrote:

 There are plans to host live region assignments in ZK and keep only an
 up-to-date copy of this state in META for use on cold boot. This is on the
 roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may
 help here.

 A TM development group saw the same behavior on a 0.19 cluster. We
 postponed looking at this because 0.20 has a significant rewrite of
 region assignment. However, it is interesting to hear such a similar
 description. I worry the underlying cause may be scanners getting stale
 data on the RS as opposed to some master problem which could be solved by
 the above, a more pervasive problem. Bradford, any chance you kept around
 logs or similar which may provide clues?

   - Andy




 
 From: Bradford Stephens bradfordsteph...@gmail.com
 To: hbase-user@hadoop.apache.org
 Sent: Friday, August 21, 2009 6:48:17 AM
 Subject: Story of my HBase Bugs / Feature Suggestions

 Hey there,

 I'm sending out this summary of how I diagnosed what was wrong with my
 cluster in hopes that you can glean some knowledge/suggestions from it :)
 Thanks for the diagnostic footwork.

 A few days ago,  I noticed that simple MR jobs I was running against
 .20-RC2
 were failing. Scanners were reaching the end of a region, and then simply
 freezing. The only indication I had of this was the Mapper timing out after
 1000 seconds -- there were no error messages in the logs for either Hadoop
 or HBase.

 It turns out that my table was corrupt:

 1. Doing a 'GET' from the shell on a row near the end of a region resulted
 in an error: Row not in expected region, or something to that effect. It
 re-appeared several times, and I never got the row content.
 2. What the Master UI indicated for the region distribution was totally
 different from what the RS reported. Row key ranges were on different
 servers than the UI knew about, and the nodes reported different start and
 end keys for a region than the UI.

 I'm not sure how this arose: I noticed after a heavy insert job that when
 we
 tried to shut down our cluster, it took 30 dots and more -- so we manually
 killed master. Would that lead to corruption?

 I finally resolved the problem by dropping the table and re-loading the
 data

 A few suggestions going forward:
 1. More useful scanner error messages: GET reported that there was a
 problem
 finding a certain row, why couldn't Scanner? There wasn't even a timeout or
 anything -- it just sat there.
 2. A fsck / restore would be useful for HBase. I imagine you can recreate
 .META. using .regioninfo and scanning blocks out of HDFS. This would play
 nice with the HBase bulk loader story, I suppose.

 I'll be happy to work on these in my spare time, if I ever get any ;)

 Cheers,
 Bradford


 --
 http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
 and Computer Science








-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science


CFP of 3rd Hadoop in China event (Hadoop World:Beijing)

2009-08-21 Thread He Yongqiang

http://www.hadooper.cn/hadoop/cgi-bin/moin.cgi/thirdcfp

Time : Sunday, November 15, 2009
City: Beijing, China

Sponsored by Yahoo!, Cloudera
Organized by hadooper.cn

Website: http://www.hadooper.cn/hadoop/cgi-bin/moin.cgi/thirdcfp

  Sorry for the cross posting. Have a good day!  =

Thanks,
Yongqiang


Re: Story of my HBase Bugs / Feature Suggestions

2009-08-21 Thread Jonathan Gray

Andy,

Bradford ran his imports when there was both a Scanner bug related to 
snapshotting that opened up a race condition, as well as the nasty bugs 
in getClosestBefore used to look things up in META.


It was most likely a combination of both of these things making for some 
rather nasty behavior.


JG

Andrew Purtell wrote:
There are plans to host live region assignments in ZK and keep only an up-to-date copy of this state in META for use on cold boot. This is on the roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may help here. 


A TM development group saw the same behavior on a 0.19 cluster. We
postponed looking at this because 0.20 has a significant rewrite of
region assignment. However, it is interesting to hear such a similar
description. I worry the underlying cause may be scanners getting stale data on 
the RS as opposed to some master problem which could be solved by the above, a 
more pervasive problem. Bradford, any chance you kept around logs or similar 
which may provide clues?

   - Andy





From: Bradford Stephens bradfordsteph...@gmail.com
To: hbase-user@hadoop.apache.org
Sent: Friday, August 21, 2009 6:48:17 AM
Subject: Story of my HBase Bugs / Feature Suggestions

Hey there,

I'm sending out this summary of how I diagnosed what was wrong with my
cluster in hopes that you can glean some knowledge/suggestions from it :)
Thanks for the diagnostic footwork.

A few days ago,  I noticed that simple MR jobs I was running against .20-RC2
were failing. Scanners were reaching the end of a region, and then simply
freezing. The only indication I had of this was the Mapper timing out after
1000 seconds -- there were no error messages in the logs for either Hadoop
or HBase.

It turns out that my table was corrupt:

1. Doing a 'GET' from the shell on a row near the end of a region resulted
in an error: Row not in expected region, or something to that effect. It
re-appeared several times, and I never got the row content.
2. What the Master UI indicated for the region distribution was totally
different from what the RS reported. Row key ranges were on different
servers than the UI knew about, and the nodes reported different start and
end keys for a region than the UI.

I'm not sure how this arose: I noticed after a heavy insert job that when we
tried to shut down our cluster, it took 30 dots and more -- so we manually
killed master. Would that lead to corruption?

I finally resolved the problem by dropping the table and re-loading the data

A few suggestions going forward:
1. More useful scanner error messages: GET reported that there was a problem
finding a certain row, why couldn't Scanner? There wasn't even a timeout or
anything -- it just sat there.
2. A fsck / restore would be useful for HBase. I imagine you can recreate
.META. using .regioninfo and scanning blocks out of HDFS. This would play
nice with the HBase bulk loader story, I suppose.

I'll be happy to work on these in my spare time, if I ever get any ;)

Cheers,
Bradford




Re: CFP of 3rd Hadoop in China event (Hadoop World:Beijing)

2009-08-21 Thread He Yongqiang
Hi all,
   
Please do not directly reply this announce email. Please send all your
messages to the secretary email in the CFP.

  Sorry for the cross posting. Have a good day!  =

Thanks,
Yongqiang

On 09-8-22 上午12:21, He Yongqiang heyongqi...@software.ict.ac.cn wrote:

 
 http://www.hadooper.cn/hadoop/cgi-bin/moin.cgi/thirdcfp
 
 Time : Sunday, November 15, 2009
 City: Beijing, China
 
 Sponsored by Yahoo!, Cloudera
 Organized by hadooper.cn
 
 Website: http://www.hadooper.cn/hadoop/cgi-bin/moin.cgi/thirdcfp
 
   Sorry for the cross posting. Have a good day!  =
 
 Thanks,
 Yongqiang



Re: HBase cluster -- 2 machines -- ZooKeeper problems when launching jobs

2009-08-21 Thread Mathias De Maré
Hi all,

I managed to solve the issue. The problem was that hbase-site.xml was not in
my classpath, so was getting ignored.

Mathias

2009/8/21 Mathias De Maré mathias.dem...@gmail.com

 Hi,

 I'm setting up a small cluster with 2 machines. One is called 'master' and
 one is called 'slave'. The master is the Hadoop master. I'm running Hadoop
 0.20.0 and HBase from svn (the 0.20 branch).
 On the master, I want to run the HBase master, and on the slave, I want to
 run a regionserver and a Zookeeper instance.

 hbase-site:

 property
 namehbase.rootdir/name
 valuehdfs://master:9000/hbase/value
 descriptionThe directory shared by region servers.
 /description
 /property
 property
 namehbase.zookeeper.property.maxClientCnxns/name
 value3000/value
 /property
 property
 namehbase.hregion.max.filesize/name
 value3200/value
 /property
 property
 namehbase.cluster.distributed/name
 valuetrue/value
 /property
 property
 namehbase.zookeeper.quorum/name
 valueslave/value
 descriptionComma separated list of servers in the ZooKeeper Quorum.
 For example, host1.mydomain.com,host2.mydomain.com,host3.mydomain.com.
 By default this is set to localhost for local and pseudo-distributed modes
 of operation. For a fully-distributed setup, this should be set to a full
 list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
 hbase-env.sh
 this is the list of servers which we will start/stop ZooKeeper on.
 /description
 /property

 Upon launching my job (on master), Zookeeper seems to crash (or something
 like it).

 I get the following output:
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:zookeeper.version=3.2.0--1, built on 05/15/2009 06:05 GMT
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client environment:host.name
 =master
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:java.version=1.6.0_14
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:java.vendor=Sun Microsystems Inc.
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:java.home=/usr/lib/jvm/java-6-sun-1.6.0.14/jre
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:java.class.path=/root/installation/hadoop/bin/../conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/root/installation/hadoop/bin/..:/root/installation/hadoop/bin/../hadoop-0.20.0-core.jar:/root/installation/hadoop/bin/../lib/commons-cli-2.0-SNAPSHOT.jar:/root/installation/hadoop/bin/../lib/commons-codec-1.3.jar:/root/installation/hadoop/bin/../lib/commons-el-1.0.jar:/root/installation/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/root/installation/hadoop/bin/../lib/commons-logging-1.0.4.jar:/root/installation/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/root/installation/hadoop/bin/../lib/commons-net-1.4.1.jar:/root/installation/hadoop/bin/../lib/core-3.1.1.jar:/root/installation/hadoop/bin/../lib/hbase-0.20.0.jar:/root/installation/hadoop/bin/../lib/heritrix-1.14.3.jar:/root/installation/hadoop/bin/../lib/hsqldb-1.8.0.10.jar:/root/installation/hadoop/bin/../lib/jasper-compiler-5.5.12.jar:/root/installation/hadoop/bin/../lib/jasper-runtime-5.5.12.jar:/root/installation/hadoop/bin/../lib/jets3t-0.6.1.jar:/root/installation/hadoop/bin/../lib/jetty-6.1.14.jar:/root/installation/hadoop/bin/../lib/jetty-util-6.1.14.jar:/root/installation/hadoop/bin/../lib/junit-3.8.1.jar:/root/installation/hadoop/bin/../lib/kfs-0.2.2.jar:/root/installation/hadoop/bin/../lib/log4j-1.2.15.jar:/root/installation/hadoop/bin/../lib/oro-2.0.8.jar:/root/installation/hadoop/bin/../lib/servlet-api-2.5-6.1.14.jar:/root/installation/hadoop/bin/../lib/slf4j-api-1.4.3.jar:/root/installation/hadoop/bin/../lib/slf4j-log4j12-1.4.3.jar:/root/installation/hadoop/bin/../lib/xmlenc-0.52.jar:/root/installation/hadoop/bin/../lib/zookeeper-r785019-hbase-1329.jar:/root/installation/hadoop/bin/../lib/jsp-2.1/jsp-2.1.jar:/root/installation/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:java.library.path=/root/installation/hadoop/bin/../lib/native/Linux-i386-32
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:java.io.tmpdir=/tmp
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:java.compiler=NA
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client environment:os.name
 =Linux
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client environment:os.arch=i386
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:os.version=2.6.24-6-xen
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client environment:user.name
 =root
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:user.home=/root
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Client
 environment:user.dir=/root/installation/hadoop
 09/08/21 16:32:34 INFO zookeeper.ZooKeeper: Initiating client connection,
 host=localhost:2181 sessionTimeout=3
 watcher=org.apache.hadoop.hbase.client.hconnectionmanager$tableserv...@126e85f
 09/08/21 16:32:34 INFO zookeeper.ClientCnxn:
 zookeeper.disableAutoWatchReset is false
 

Re: Tip when migrating your data loading MR jobs from 0.19 to 0.20

2009-08-21 Thread Jonathan Gray
Honestly I'm not completely familiar with the implementation of 
DFSClient and the buffers.


What I can say is that the contract to HDFS via DFSClient for an 
appended file is that a sync() or sync()-like method can be called, and 
when it returns, it ensures that if the client dies, all data up to the 
sync() will be available to any reader.


Schubert Zhang wrote:

@JG:

Regards the write-buffer, I means the DFSClient side buffer. In the current
version of HDFS, I found the buffer (bytesPerChecksum) in client size. The
writed data will be flushed to data node when the buffer full. The HBase RS
is a client of HDFS.
@JD:

you wrote:
But, in many cases, a RS crash still means that you must restart your job
because log splitting can take more than 10 minutes so many tasks times out
(I am currently working on that for 0.21 to make it really faster btw).

I think this (task will timeout) is just a temporary or
uncertain phenomenon.
For example, there is a MapReduce (map only) job, each map task put rows
into a same HBase table. When one RS crash, maybe only one MapTask fails.
The failed MapTask will be relanuched. So I consider to call
HBaseAdmin.flush() when each MapTask colpleted. But too many
HBaseAdmin.flush() will cause too many small HStoreFiles and then too many
compactions. If we call HBaseAdmin.flush() when the job complete, must
ensure RS not crash before it.


On Thu, Aug 20, 2009 at 1:24 AM, Jonathan Gray jl...@streamy.com wrote:


Are you referring to the actual disk or raid card write buffer?  If so,
then yes a single node could lose data if you don't have a raid card with a
BBU, but remember that this ain't no RDBMS, leave your raid cards and
batteries at home!

HDFS append does not just append to a single node, it ensures that the
append is replicated just like every other operation.  So (on default of 3)
you would have to lose 3 nodes at the same instant, which is the
impossible paradigm we are working with by using a rep factor of 3.

JG


Schubert Zhang wrote:


Thank you J-D, it's a good post.
I have test the performance of put.setWriteToWAL(false), it is really
fast.
And after the batch loading, we shoud call admin.flush(tablename) to flush
all data to HDFS.

More question about aHDFS support append, and the logs can work as append
mode.
I think HDFS will still have write-buffer in the future. Then if the
server
crash when the appends in the write-buffer are not write into disk, the
data
in the write-buffer will be lost?


On Wed, Aug 12, 2009 at 1:40 AM, Jean-Daniel Cryans jdcry...@apache.org

wrote:

Hi users,

This weekend at the second HBase Hackathon (held by StumbleUpon, thx!)
we helped someone migrating a data loading MapReduce job from 0.19 to
0.20 because of a performance problem. It was something like 20x
slower.

How we solved it, short answer:
After instantiating the Put that you give to the TableOutputFormat, do
put.writeToWAL(false).

Long answer:
As you may know, HDFS still does not support appends. That means that
the write ahead logs or WAL that HBase uses are only helpful if synced
on disk. That means that you lose some data during a region server
crash or a kill -9. In 0.19 the logs could be opened forever if they
had under 10 edits. Now in 0.20 we fixed that by capping the WAL
to ~62MB and we also rotate the logs after 1 hour. This is all good
because it means far less data loss until we are able to append to
files in HDFS.

Now to why this may slow down your import, the job I was talking about
had huge rows so the logs got rotated much more often whereas in 0.19
only the number of rows triggered a log rotation. Not writing to the
WAL has the advantage of using far less disk IO but, as you can guess,
it means huge data loss in the case of a region server crash. But, in
many cases, a RS crash still means that you must restart your job
because log splitting can take more than 10 minutes so many tasks
times out (I am currently working on that for 0.21 to make it really
faster btw).

Hopes this helps someone,

J-D






Re: Doubt in HBase

2009-08-21 Thread Jonathan Gray

Ryan,

In older versions of HBase, when we did not attempt any data locality, 
we had a few users running jobs that became network i/o bound.  It 
wasn't a latency issue it was a bandwidth issue.


That's actually when/why an attempt at better data locality for HBase MR 
was made in the first place.


I hadn't personally experienced it but I recall two users who had. 
After they made a first-stab patch, I ran some comparisons and noticed a 
significant reduction in network i/o for data-intensive MR jobs.  They 
also were no longer network i/o bound on their jobs, if I recall, and 
became disk i/o bound (as one would expect/hope).


For a majority of use cases, it doesn't matter in a significant way at 
all.  But I have seen it make a measurable difference for some.


JG

bharath vissapragada wrote:

Thanks Ryan

I was just explaining with an example .. I have TBs of data to work
with.Just i wanted to know that scheduler TRIES to assign the reduce phase
to keep the data local (i.e.,TRYING  to assign it to the machine with
machine with greater num of key values).
I was just explaining it with an example .

Thanks for ur reply (following u on twitter :))

On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:


hey,

Yes the hadoop system attempts to assign map tasks to data local, but
why would you be worried about this for 5 values?  The max value size
in hbase is Integer.MAX_VALUE, so it's not like you have much data to
shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
use HDFS directly and keep only the metadata in hbase (including
things like location of the data blob).

I think people are confused about how optimal map reduces have to be.
Keeping all the data super-local on each machine is not always helping
you, since you have to read via a socket anyways. Going remote doesn't
actually make things that much slower, since on a modern lan ping
times are  0.1ms.  If your entire cluster is hanging off a single
switch, there is nearly unlimited bandwidth between all nodes
(certainly much higher than any single system could push).  Only once
you go multi-switch then switch-locality (aka rack locality) becomes
important.

Remember, hadoop isn't about the instantaneous speed of any job, but
about running jobs in a highly scalable manner that works on tens or
tens of thousands of nodes. You end up blocking on single machine
limits anyways, and the r=3 of HDFS helps you transcend a single
machine read speed for large files. Keeping the data transfer local in
this case results in lower performance.

If you want max local speed, I suggest looking at CUDA.


On Thu, Aug 20, 2009 at 9:09 PM, bharath
vissapragadabharathvissapragada1...@gmail.com wrote:

Aamandeep , Gray and Purtell thanks for your replies .. I have found them
very useful.

You said to increase the number of reduce tasks . Suppose the number of
reduce tasks is more than number of distinct map output keys , some of

the

reduce processes may go waste ? is that the case?

Also  I have one more doubt ..I have 5 values for a corresponding key on

one

region  and other 2 values on 2 different region servers.
Does hadoop Map reduce take care of moving these 2 diff values to the

region

with 5 values instead of moving those 5 values to other system to

minimize

the dataflow? Is this what is happening inside ?

On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org

wrote:

The behavior of TableInputFormat is to schedule one mapper for every

table

region.

In addition to what others have said already, if your reducer is doing
little more than storing data back into HBase (via TableOutputFormat),

then

you can consider writing results back to HBase directly from the mapper

to

avoid incurring the overhead of sort/shuffle/merge which happens within

the

Hadoop job framework as map outputs are input into reducers. For that

type

of use case -- using the Hadoop mapreduce subsystem as essentially a

grid

scheduler -- something like job.setNumReducers(0) will do the trick.

Best regards,

  - Andy





From: john smith js1987.sm...@gmail.com
To: hbase-user@hadoop.apache.org
Sent: Friday, August 21, 2009 12:42:36 AM
Subject: Doubt in HBase

Hi all ,

I have one small doubt . Kindly answer it even if it sounds silly.

Iam using Map Reduce in HBase in distributed mode .  I have a table

which

spans across 5 region servers . I am using TableInputFormat to read the
data
from the tables in the map . When i run the program , by default how

many

map regions are created ? Is it one per region server or more ?

Also after the map task is over.. reduce task is taking a bit more time

.

Is
it due to moving the map output across the regionservers? i.e, moving

the

values of same key to a particular reduce phase to start the reducer? Is
there any way i can optimize the code (e.g. by storing data of same

reducer

nearby )

Thanks :)








MapReduce and Hbase - info FIVE

2009-08-21 Thread Taylor, Ronald C
 

-Original Message-
From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com] 
Sent: Friday, August 21, 2009 12:14 AM
To: hbase-user@hadoop.apache.org
Subject: Re: Doubt in HBase

Thanks Ryan

I was just explaining with an example .. I have TBs of data to work
with.Just i wanted to know that scheduler TRIES to assign the reduce
phase to keep the data local (i.e.,TRYING  to assign it to the machine
with machine with greater num of key values).
I was just explaining it with an example .

Thanks for ur reply (following u on twitter :))

On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com
wrote:

 hey,

 Yes the hadoop system attempts to assign map tasks to data local, but 
 why would you be worried about this for 5 values?  The max value size 
 in hbase is Integer.MAX_VALUE, so it's not like you have much data to 
 shuffle. Once your blobs  ~ 64mb or so, it might make more sense to 
 use HDFS directly and keep only the metadata in hbase (including 
 things like location of the data blob).

 I think people are confused about how optimal map reduces have to be.
 Keeping all the data super-local on each machine is not always helping

 you, since you have to read via a socket anyways. Going remote doesn't

 actually make things that much slower, since on a modern lan ping 
 times are  0.1ms.  If your entire cluster is hanging off a single 
 switch, there is nearly unlimited bandwidth between all nodes 
 (certainly much higher than any single system could push).  Only once 
 you go multi-switch then switch-locality (aka rack locality) becomes 
 important.

 Remember, hadoop isn't about the instantaneous speed of any job, but 
 about running jobs in a highly scalable manner that works on tens or 
 tens of thousands of nodes. You end up blocking on single machine 
 limits anyways, and the r=3 of HDFS helps you transcend a single 
 machine read speed for large files. Keeping the data transfer local in

 this case results in lower performance.

 If you want max local speed, I suggest looking at CUDA.


 On Thu, Aug 20, 2009 at 9:09 PM, bharath 
 vissapragadabharathvissapragada1...@gmail.com wrote:
  Aamandeep , Gray and Purtell thanks for your replies .. I have found

  them very useful.
 
  You said to increase the number of reduce tasks . Suppose the number

  of reduce tasks is more than number of distinct map output keys , 
  some of
 the
  reduce processes may go waste ? is that the case?
 
  Also  I have one more doubt ..I have 5 values for a corresponding 
  key on
 one
  region  and other 2 values on 2 different region servers.
  Does hadoop Map reduce take care of moving these 2 diff values to 
  the
 region
  with 5 values instead of moving those 5 values to other system to
 minimize
  the dataflow? Is this what is happening inside ?
 
  On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell 
  apurt...@apache.org
 wrote:
 
  The behavior of TableInputFormat is to schedule one mapper for 
  every
 table
  region.
 
  In addition to what others have said already, if your reducer is 
  doing little more than storing data back into HBase (via 
  TableOutputFormat),
 then
  you can consider writing results back to HBase directly from the 
  mapper
 to
  avoid incurring the overhead of sort/shuffle/merge which happens 
  within
 the
  Hadoop job framework as map outputs are input into reducers. For 
  that
 type
  of use case -- using the Hadoop mapreduce subsystem as essentially 
  a
 grid
  scheduler -- something like job.setNumReducers(0) will do the
trick.
 
  Best regards,
 
- Andy
 
 
 
 
  
  From: john smith js1987.sm...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Friday, August 21, 2009 12:42:36 AM
  Subject: Doubt in HBase
 
  Hi all ,
 
  I have one small doubt . Kindly answer it even if it sounds silly.
 
  Iam using Map Reduce in HBase in distributed mode .  I have a table
 which
  spans across 5 region servers . I am using TableInputFormat to read

  the data from the tables in the map . When i run the program , by 
  default how
 many
  map regions are created ? Is it one per region server or more ?
 
  Also after the map task is over.. reduce task is taking a bit more 
  time
 .
  Is
  it due to moving the map output across the regionservers? i.e, 
  moving
 the
  values of same key to a particular reduce phase to start the 
  reducer? Is there any way i can optimize the code (e.g. by storing 
  data of same
 reducer
  nearby )
 
  Thanks :)
 
 
 
 
 



Re: Doubt in HBase

2009-08-21 Thread bharath vissapragada
JG

Can you please elaborate on the last statement for some.. by giving an
example or some kind of scenario in which it can take place where MR jobs
involve huge amount of data.

Thanks.

On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com wrote:

 Ryan,

 In older versions of HBase, when we did not attempt any data locality, we
 had a few users running jobs that became network i/o bound.  It wasn't a
 latency issue it was a bandwidth issue.

 That's actually when/why an attempt at better data locality for HBase MR
 was made in the first place.

 I hadn't personally experienced it but I recall two users who had. After
 they made a first-stab patch, I ran some comparisons and noticed a
 significant reduction in network i/o for data-intensive MR jobs.  They also
 were no longer network i/o bound on their jobs, if I recall, and became disk
 i/o bound (as one would expect/hope).

 For a majority of use cases, it doesn't matter in a significant way at all.
  But I have seen it make a measurable difference for some.

 JG


 bharath vissapragada wrote:

 Thanks Ryan

 I was just explaining with an example .. I have TBs of data to work
 with.Just i wanted to know that scheduler TRIES to assign the reduce phase
 to keep the data local (i.e.,TRYING  to assign it to the machine with
 machine with greater num of key values).
 I was just explaining it with an example .

 Thanks for ur reply (following u on twitter :))

 On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:

  hey,

 Yes the hadoop system attempts to assign map tasks to data local, but
 why would you be worried about this for 5 values?  The max value size
 in hbase is Integer.MAX_VALUE, so it's not like you have much data to
 shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
 use HDFS directly and keep only the metadata in hbase (including
 things like location of the data blob).

 I think people are confused about how optimal map reduces have to be.
 Keeping all the data super-local on each machine is not always helping
 you, since you have to read via a socket anyways. Going remote doesn't
 actually make things that much slower, since on a modern lan ping
 times are  0.1ms.  If your entire cluster is hanging off a single
 switch, there is nearly unlimited bandwidth between all nodes
 (certainly much higher than any single system could push).  Only once
 you go multi-switch then switch-locality (aka rack locality) becomes
 important.

 Remember, hadoop isn't about the instantaneous speed of any job, but
 about running jobs in a highly scalable manner that works on tens or
 tens of thousands of nodes. You end up blocking on single machine
 limits anyways, and the r=3 of HDFS helps you transcend a single
 machine read speed for large files. Keeping the data transfer local in
 this case results in lower performance.

 If you want max local speed, I suggest looking at CUDA.


 On Thu, Aug 20, 2009 at 9:09 PM, bharath
 vissapragadabharathvissapragada1...@gmail.com wrote:

 Aamandeep , Gray and Purtell thanks for your replies .. I have found
 them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number of
 reduce tasks is more than number of distinct map output keys , some of

 the

 reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key on

 one

 region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the

 region

 with 5 values instead of moving those 5 values to other system to

 minimize

 the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org

 wrote:

 The behavior of TableInputFormat is to schedule one mapper for every

 table

 region.

 In addition to what others have said already, if your reducer is doing
 little more than storing data back into HBase (via TableOutputFormat),

 then

 you can consider writing results back to HBase directly from the mapper

 to

 avoid incurring the overhead of sort/shuffle/merge which happens within

 the

 Hadoop job framework as map outputs are input into reducers. For that

 type

 of use case -- using the Hadoop mapreduce subsystem as essentially a

 grid

 scheduler -- something like job.setNumReducers(0) will do the trick.

 Best regards,

  - Andy




 
 From: john smith js1987.sm...@gmail.com
 To: hbase-user@hadoop.apache.org
 Sent: Friday, August 21, 2009 12:42:36 AM
 Subject: Doubt in HBase

 Hi all ,

 I have one small doubt . Kindly answer it even if it sounds silly.

 Iam using Map Reduce in HBase in distributed mode .  I have a table

 which

 spans across 5 region servers . I am using TableInputFormat to read the
 data
 from the tables in the map . When i run the program , by default how

 many

 map regions are created ? Is it one per region server or more ?

 Also after the map task is 

Re: Doubt in HBase

2009-08-21 Thread Jonathan Gray

I really couldn't be specific.

The more data that has to be moved across the wire, the more network i/o.

For example, if you have very large values, and a very large table, and 
you have that as the input to your MR.  You could potentially be network 
i/o bound.


It should be very easy to test how your own jobs run on your own cluster 
using Ganglia and hadoop/mr logging/output.


bharath vissapragada wrote:

JG

Can you please elaborate on the last statement for some.. by giving an
example or some kind of scenario in which it can take place where MR jobs
involve huge amount of data.

Thanks.

On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com wrote:


Ryan,

In older versions of HBase, when we did not attempt any data locality, we
had a few users running jobs that became network i/o bound.  It wasn't a
latency issue it was a bandwidth issue.

That's actually when/why an attempt at better data locality for HBase MR
was made in the first place.

I hadn't personally experienced it but I recall two users who had. After
they made a first-stab patch, I ran some comparisons and noticed a
significant reduction in network i/o for data-intensive MR jobs.  They also
were no longer network i/o bound on their jobs, if I recall, and became disk
i/o bound (as one would expect/hope).

For a majority of use cases, it doesn't matter in a significant way at all.
 But I have seen it make a measurable difference for some.

JG


bharath vissapragada wrote:


Thanks Ryan

I was just explaining with an example .. I have TBs of data to work
with.Just i wanted to know that scheduler TRIES to assign the reduce phase
to keep the data local (i.e.,TRYING  to assign it to the machine with
machine with greater num of key values).
I was just explaining it with an example .

Thanks for ur reply (following u on twitter :))

On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:

 hey,

Yes the hadoop system attempts to assign map tasks to data local, but
why would you be worried about this for 5 values?  The max value size
in hbase is Integer.MAX_VALUE, so it's not like you have much data to
shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
use HDFS directly and keep only the metadata in hbase (including
things like location of the data blob).

I think people are confused about how optimal map reduces have to be.
Keeping all the data super-local on each machine is not always helping
you, since you have to read via a socket anyways. Going remote doesn't
actually make things that much slower, since on a modern lan ping
times are  0.1ms.  If your entire cluster is hanging off a single
switch, there is nearly unlimited bandwidth between all nodes
(certainly much higher than any single system could push).  Only once
you go multi-switch then switch-locality (aka rack locality) becomes
important.

Remember, hadoop isn't about the instantaneous speed of any job, but
about running jobs in a highly scalable manner that works on tens or
tens of thousands of nodes. You end up blocking on single machine
limits anyways, and the r=3 of HDFS helps you transcend a single
machine read speed for large files. Keeping the data transfer local in
this case results in lower performance.

If you want max local speed, I suggest looking at CUDA.


On Thu, Aug 20, 2009 at 9:09 PM, bharath
vissapragadabharathvissapragada1...@gmail.com wrote:


Aamandeep , Gray and Purtell thanks for your replies .. I have found
them
very useful.

You said to increase the number of reduce tasks . Suppose the number of
reduce tasks is more than number of distinct map output keys , some of


the


reduce processes may go waste ? is that the case?

Also  I have one more doubt ..I have 5 values for a corresponding key on


one


region  and other 2 values on 2 different region servers.
Does hadoop Map reduce take care of moving these 2 diff values to the


region


with 5 values instead of moving those 5 values to other system to


minimize


the dataflow? Is this what is happening inside ?

On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org


wrote:


The behavior of TableInputFormat is to schedule one mapper for every
table
region.

In addition to what others have said already, if your reducer is doing
little more than storing data back into HBase (via TableOutputFormat),


then
you can consider writing results back to HBase directly from the mapper
to
avoid incurring the overhead of sort/shuffle/merge which happens within
the
Hadoop job framework as map outputs are input into reducers. For that
type
of use case -- using the Hadoop mapreduce subsystem as essentially a
grid
scheduler -- something like job.setNumReducers(0) will do the trick.

Best regards,

 - Andy





From: john smith js1987.sm...@gmail.com
To: hbase-user@hadoop.apache.org
Sent: Friday, August 21, 2009 12:42:36 AM
Subject: Doubt in HBase

Hi all ,

I have one small doubt . Kindly answer it even if it sounds silly.


Re: Many 2 one in a row - modeling options

2009-08-21 Thread tim robertson
Just to keep this with the rest of the thread as this might be useful
to others using HBase.

I got a quick chance to test protobuf
(http://code.google.com/p/protobuf/) and on my Macbook pro and found
that it can serialize and deserialize a Collection of Objects, each
with 9 Strings, at a rate of 74 serializations and deserializations
per millisecond.  Deserialization alone was at a rate of 146 per
millisecond.

Thanks for pointing to protobuf Jonathan - seems it is well placed for
use in HBase where complex types are needed (e.g. my many2one in a
single row)

Tim


On Wed, Aug 19, 2009 at 10:14 PM, tim
robertsontimrobertson...@gmail.com wrote:
 Thanks JG for taking the time to digest that and comment.

 It was a hastily written page as I am on vacation... I'm heading over
 to the bay area for a few days and wanted to start getting something
 together to discuss with Stack and anyone else free on Tuesday 8th.  I
 hope it develops into a good case study for the HBase community.

 In terms of operations, I think it will boil down to 3 things:
  a) full scans building search indexes (probably lucene)

 b) scanning and annotating same row (such as a field for a named
 national park the point falls in)
    -  in terms of the scientific identification there will be a fair
 amount of scanning the identification and then annotating a new column
 with an ID for the equivalent of an external taxonomy.  One example
 might be that you want to browse the occurrence store (specimens and
 observations) using the Catalogue of Life taxonomy, and this record
 would be found with the identifier of
 http://www.catalogueoflife.org/annual-checklist/2008/show_species_details.php?record_id=5204463.
  It is not as simple as a name match as the synonymy is subject to
 differing opinion.

 c) scans with filters.  I expect that most of these raw values will be
 parsed out to better typed values (hash encoded lat long etc) and the
 filters would be on those families and not these raw families.  I
 think the identifications would be parsed to ID's of well known
 taxonomies, and the filters would be using those values.

 I was expecting serializing to be the most likely choice, and I'll
 start checking out the protobufs stuff - I have been writing my own
 serializations based on Writable for storing values in lucene indexes
 recently.

 I'll clean up the wiki and probably have more questions,

 Cheers,

 Tim



 On Wed, Aug 19, 2009 at 7:20 PM, Jonathan Grayjl...@streamy.com wrote:
 Tim,

 Very cool wiki page.  Unfortunately I'm a little confused about exactly what
 the requirements are.

 Does each species (and the combination of all of its identifications)
 actually have a single, unique ID?

 The most important thing when designing your HBase schema is to understand
 how you want to query it.  And I'm not exactly sure I follow that part.

 I'm going to assume that there is a single, relatively static set of
 attributes for each unique ID (the GUID, Cat#, etc).  Let's put that in a
 family, call it attributes.  You would use that family as a key/value
 dictionary.  The qualifier would be the attribute name, and the value would
 be the attribute value (ie. attributes:InstCode with value MNHA).

 The row, in this case, would be the GUID or whatever unique ID you want to
 lookup by.

 Now the other part, storing the identifications.  I would definitely vote
 against multiples rows, multiple tables, and multiple families.  As you
 point out, multiple tables would require joining, multiple families does in
 fact mean 2 separate files, and multiple rows adds a great deal of
 complexity (you need to Scan now, cannot rely on Get).

 So let's say we have a family identifications (though you may want to
 shorten these family names as they are actually stored explicitly for every
 single cell... maybe ids).  For each identification, you would have a
 single column.  The qualifier of that column would be whatever the unique
 identifier is for that identification, or if there isn't one, you could just
 wrap up the entire thing in to a serialized type and use that as the
 qualifier.  If you have an ID, then I would serialize the identification
 into the value.

 You point out that this would have poor scanning performance because of the
 need for deserialization, but I don't necessarily agree.  That can be quite
 fast, depending on implementation, and there's a great deal of
 serialization/deserialization being done behind the scenes to even get the
 data to you in the first place.

 Something like protobufs has very efficient and fast serialize/deserialize
 operations.  Java serialization is inefficient in space and can be slow,
 which is why HBase and Hadoop implement the Writable interface and provide a
 minimal/efficient/binary serialization.

 I do think that is the by far the best approach here, the
 serialization/deserialization should be orders of magnitude faster than
 round-trip network latency.

 I didn't realize your first bullet was 

Re: Many 2 one in a row - modeling options

2009-08-21 Thread Jonathan Gray

Excellent.  Nice to have some official numbers.

Thanks Tim.

JG

tim robertson wrote:

Just to keep this with the rest of the thread as this might be useful
to others using HBase.

I got a quick chance to test protobuf
(http://code.google.com/p/protobuf/) and on my Macbook pro and found
that it can serialize and deserialize a Collection of Objects, each
with 9 Strings, at a rate of 74 serializations and deserializations
per millisecond.  Deserialization alone was at a rate of 146 per
millisecond.

Thanks for pointing to protobuf Jonathan - seems it is well placed for
use in HBase where complex types are needed (e.g. my many2one in a
single row)

Tim


On Wed, Aug 19, 2009 at 10:14 PM, tim
robertsontimrobertson...@gmail.com wrote:

Thanks JG for taking the time to digest that and comment.

It was a hastily written page as I am on vacation... I'm heading over
to the bay area for a few days and wanted to start getting something
together to discuss with Stack and anyone else free on Tuesday 8th.  I
hope it develops into a good case study for the HBase community.

In terms of operations, I think it will boil down to 3 things:
 a) full scans building search indexes (probably lucene)

b) scanning and annotating same row (such as a field for a named
national park the point falls in)
   -  in terms of the scientific identification there will be a fair
amount of scanning the identification and then annotating a new column
with an ID for the equivalent of an external taxonomy.  One example
might be that you want to browse the occurrence store (specimens and
observations) using the Catalogue of Life taxonomy, and this record
would be found with the identifier of
http://www.catalogueoflife.org/annual-checklist/2008/show_species_details.php?record_id=5204463.
 It is not as simple as a name match as the synonymy is subject to
differing opinion.

c) scans with filters.  I expect that most of these raw values will be
parsed out to better typed values (hash encoded lat long etc) and the
filters would be on those families and not these raw families.  I
think the identifications would be parsed to ID's of well known
taxonomies, and the filters would be using those values.

I was expecting serializing to be the most likely choice, and I'll
start checking out the protobufs stuff - I have been writing my own
serializations based on Writable for storing values in lucene indexes
recently.

I'll clean up the wiki and probably have more questions,

Cheers,

Tim



On Wed, Aug 19, 2009 at 7:20 PM, Jonathan Grayjl...@streamy.com wrote:

Tim,

Very cool wiki page.  Unfortunately I'm a little confused about exactly what
the requirements are.

Does each species (and the combination of all of its identifications)
actually have a single, unique ID?

The most important thing when designing your HBase schema is to understand
how you want to query it.  And I'm not exactly sure I follow that part.

I'm going to assume that there is a single, relatively static set of
attributes for each unique ID (the GUID, Cat#, etc).  Let's put that in a
family, call it attributes.  You would use that family as a key/value
dictionary.  The qualifier would be the attribute name, and the value would
be the attribute value (ie. attributes:InstCode with value MNHA).

The row, in this case, would be the GUID or whatever unique ID you want to
lookup by.

Now the other part, storing the identifications.  I would definitely vote
against multiples rows, multiple tables, and multiple families.  As you
point out, multiple tables would require joining, multiple families does in
fact mean 2 separate files, and multiple rows adds a great deal of
complexity (you need to Scan now, cannot rely on Get).

So let's say we have a family identifications (though you may want to
shorten these family names as they are actually stored explicitly for every
single cell... maybe ids).  For each identification, you would have a
single column.  The qualifier of that column would be whatever the unique
identifier is for that identification, or if there isn't one, you could just
wrap up the entire thing in to a serialized type and use that as the
qualifier.  If you have an ID, then I would serialize the identification
into the value.

You point out that this would have poor scanning performance because of the
need for deserialization, but I don't necessarily agree.  That can be quite
fast, depending on implementation, and there's a great deal of
serialization/deserialization being done behind the scenes to even get the
data to you in the first place.

Something like protobufs has very efficient and fast serialize/deserialize
operations.  Java serialization is inefficient in space and can be slow,
which is why HBase and Hadoop implement the Writable interface and provide a
minimal/efficient/binary serialization.

I do think that is the by far the best approach here, the
serialization/deserialization should be orders of magnitude faster than
round-trip network latency.

I didn't realize 

Re: Location of HBase's database (database' s files) on the hard disk

2009-08-21 Thread Nguyen Thi Ngoc Huong
ok. This is my error. I wasn't stop Hbase and Hadoop when restarting my
computer.
Thank you very much.

2009/8/21 Jean-Daniel Cryans jdcry...@apache.org

 2 questions:

 Which version of HBase are you using?

 Are you stopping HBase and Hadoop when restarting your computer?

 Thx,

 J-D

 On Fri, Aug 21, 2009 at 5:46 AM, Nguyen Thi Ngoc
 Huonghuongn...@gmail.com wrote:
  Thank you very much.
  I deleted everything and configured hadoop.tmp.dir property in
  hadoop-site.xml as follow
  property
   namehadoop.tmp.dir/name
   value/home/huongntn/hadoop-${user.name}/value
   descriptionA base for other temporary directories./description
  /property
 
  After that, I formated namenode and start-all. When I restarted my
 computer
  and typed the command start-all, hadoop work smoothly. I start hbase by
  command ./bin/start-hbase.sh and ./hbase shell
 
  Now  I can't see my database in hbase shell (by command list) although
 I
  can see it in Hadoop site manager,
 
 
  2009/8/21 Amandeep Khurana ama...@gmail.com
 
  1. If you have formatted your namenode before starting the first time,
  thats
  all thats needed.
 
  2. To start from scratch, delete everything thats there in the directory
  where you are pointing your hdfs to; format namenode again; start all
 
  3. If it still doesnt work, look at the namenode logs to see whats
  happening. Post it here if you cant figure it out.
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Fri, Aug 21, 2009 at 1:30 AM, Nguyen Thi Ngoc Huong
  huongn...@gmail.comwrote:
 
   You dont need to format the namenode everytime.. Just
 bin/start-all.sh
  
   Really? Just bin/start-all.sh, namnode is not started (when I type
  command
   jps, there are only 5 processes
   3421 SecondaryNameNode
   3492 JobTracker
   3582 TaskTracker
   4031 Jps
   3325 DataNode, there isn't Namenode process)
   and certainly, the page http://localhost:50070 is died and connection
  from
   Hbase to hadoop is died, too
  
  
   2009/8/21 Amandeep Khurana ama...@gmail.com
  
On Fri, Aug 21, 2009 at 1:03 AM, Nguyen Thi Ngoc Huong
huongn...@gmail.comwrote:
   
 Thanks you very much. I editted file hbase-site.xml as follow

 property
namehbase.rootdir/name
 valuehdfs://localhost:54310/hbase/value
 descriptionThe directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
/description
  /property

 with fs.default.name is hdfs://localhost:54310
 Now, I can see hbase database in Hadoop site manager (in hbase
directory
 not tmp directory in hdfs ).
 However, when I restart my computer, I must restart hadoop (by
  command
 ./bin/hadoop format namenode and ./bin/start all) , restart hbase,
  and
   my
 database is lost. What can I do to save my database?

   
You dont need to format the namenode everytime.. Just
 bin/start-all.sh
   
   
   

 2009/8/21 Amandeep Khurana ama...@gmail.com

  On Thu, Aug 20, 2009 at 11:46 PM, Nguyen Thi Ngoc Huong 
  huongn...@gmail.com
   wrote:
 
   How can I configure the location of the hbase directory? I
   configured
   hbase-site.xml as follow:
  
   property
  namehbase.rootdir/name
  value*file:///temp/hbase-${user.name}/hbase*/value
  descriptionThe directory shared by region servers.
  Should be fully-qualified to include the filesystem to use.
  E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
  /description
/property
  
 
  Thats the trouble.. Your data is being stored in the temp..
 instead
store
  it
  in your hdfs.
  so the value of the above property would be something like
  *hdfs://namenodeserver:port/hbase*
 
 
 
  
   and the log file is
   Not starting HMaster because:
   java.io.IOException: Mkdirs failed to create
   file:/temp/hbase-huongntn/hbase
   at
  
 

   
  
 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
  at
 org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
  at
 org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
  at
 org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
  at
 org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
  at
org.apache.hadoop.hbase.util.FSUtils.setVersion(FSUtils.java:141)
  at
   org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:210)
  at
   org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:156)
  at
  
 

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:96)
  at
  
 

   
  
 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:78)
  at

Re: Doubt in HBase

2009-08-21 Thread bharath vissapragada
JG,

In one of your above replies , you have said that datalocality was not
considered in older versions of HBase , Is there  any development on the
same in 0.20 RC1/2 or 0.19.x ? If no can you tell me where that patch can be
available so that i can test my programs .

Thanks in advance

On Sat, Aug 22, 2009 at 12:12 AM, Jonathan Gray jl...@streamy.com wrote:

 I really couldn't be specific.

 The more data that has to be moved across the wire, the more network i/o.

 For example, if you have very large values, and a very large table, and you
 have that as the input to your MR.  You could potentially be network i/o
 bound.

 It should be very easy to test how your own jobs run on your own cluster
 using Ganglia and hadoop/mr logging/output.


 bharath vissapragada wrote:

 JG

 Can you please elaborate on the last statement for some.. by giving an
 example or some kind of scenario in which it can take place where MR jobs
 involve huge amount of data.

 Thanks.

 On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com
 wrote:

  Ryan,

 In older versions of HBase, when we did not attempt any data locality, we
 had a few users running jobs that became network i/o bound.  It wasn't a
 latency issue it was a bandwidth issue.

 That's actually when/why an attempt at better data locality for HBase MR
 was made in the first place.

 I hadn't personally experienced it but I recall two users who had. After
 they made a first-stab patch, I ran some comparisons and noticed a
 significant reduction in network i/o for data-intensive MR jobs.  They
 also
 were no longer network i/o bound on their jobs, if I recall, and became
 disk
 i/o bound (as one would expect/hope).

 For a majority of use cases, it doesn't matter in a significant way at
 all.
  But I have seen it make a measurable difference for some.

 JG


 bharath vissapragada wrote:

  Thanks Ryan

 I was just explaining with an example .. I have TBs of data to work
 with.Just i wanted to know that scheduler TRIES to assign the reduce
 phase
 to keep the data local (i.e.,TRYING  to assign it to the machine with
 machine with greater num of key values).
 I was just explaining it with an example .

 Thanks for ur reply (following u on twitter :))

 On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com
 wrote:

  hey,

 Yes the hadoop system attempts to assign map tasks to data local, but
 why would you be worried about this for 5 values?  The max value size
 in hbase is Integer.MAX_VALUE, so it's not like you have much data to
 shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
 use HDFS directly and keep only the metadata in hbase (including
 things like location of the data blob).

 I think people are confused about how optimal map reduces have to be.
 Keeping all the data super-local on each machine is not always helping
 you, since you have to read via a socket anyways. Going remote doesn't
 actually make things that much slower, since on a modern lan ping
 times are  0.1ms.  If your entire cluster is hanging off a single
 switch, there is nearly unlimited bandwidth between all nodes
 (certainly much higher than any single system could push).  Only once
 you go multi-switch then switch-locality (aka rack locality) becomes
 important.

 Remember, hadoop isn't about the instantaneous speed of any job, but
 about running jobs in a highly scalable manner that works on tens or
 tens of thousands of nodes. You end up blocking on single machine
 limits anyways, and the r=3 of HDFS helps you transcend a single
 machine read speed for large files. Keeping the data transfer local in
 this case results in lower performance.

 If you want max local speed, I suggest looking at CUDA.


 On Thu, Aug 20, 2009 at 9:09 PM, bharath
 vissapragadabharathvissapragada1...@gmail.com wrote:

  Aamandeep , Gray and Purtell thanks for your replies .. I have found
 them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number
 of
 reduce tasks is more than number of distinct map output keys , some of

  the

  reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key
 on

  one

  region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the

  region

  with 5 values instead of moving those 5 values to other system to

  minimize

  the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org

  wrote:

  The behavior of TableInputFormat is to schedule one mapper for every
 table
 region.

 In addition to what others have said already, if your reducer is
 doing
 little more than storing data back into HBase (via
 TableOutputFormat),

  then
 you can consider writing results back to HBase directly from the
 mapper
 to
 avoid incurring the overhead of sort/shuffle/merge which happens
 within
 the
 Hadoop job framework as map outputs are