Re: HBase mention in VLDB keynote

2009-08-26 Thread Andrew Purtell
Right, the point I was making is not about absolute numbers but the scale of 
the test and successful results at that scale. I would think that is on par 
with the (failed) experimentation at Yahoo, but have yet to see the evaluation 
materials posted anywhere.

   - Andy






From: Jonathan Gray jl...@streamy.com
To: hbase-user@hadoop.apache.org
Sent: Tuesday, August 25, 2009 11:08:17 PM
Subject: Re: HBase mention in VLDB keynote

If you are just looking for numbers, they can vary quite drastically 
depending on the cluster configuration, cluster hardware, jvm/gc 
configuration, dataset properties, read patterns, and load patterns. 
The ones I provided in that presentation are on a very small cluster but 
with simple data and low load, my attempt at some getting some base numbers.

You really need to load up some of your own data and see how it behaves 
on your own cluster.  And tuning is increasingly important now as we are 
limited by Java GC quite a bit.

JG

Schubert Zhang wrote:
 @stack
 We know HIVE-705, and already have good communication with the contributor,
 since we are all chinese. :-)
 In fact some code of the patch are used and tested in our project. But we
 need more flexible data store schema to resolve engineering problems,
 especially performance and practicability.
 
 @andy
 Does ryan's result different from JG's?
 On Wed, Aug 26, 2009 at 2:50 AM, Andrew Purtell apurt...@apache.org wrote:
 
 Hi Schubert,


 Regards ...and JG's/Ryan's performance test results for 0.20 stand as a
 contradiction. Can you provide more references? such as a url/link of these
 contradiction?

 For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime

 I'm sure you have seen this already.

 Ryan has posted some information on the list now and again.

 Also I think your work with performance evaluation is very important
 feedback and data points. Thanks for that.

 We are doing a interesting thing to make Hive can use HBase as it's data
 store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
 and also we can directly query/scan data from HBase.

 That sounds REALLY interesting!

   - Andy




 
 From: Schubert Zhang zson...@gmail.com
 To: hbase-user@hadoop.apache.org
 Sent: Tuesday, August 25, 2009 8:26:50 PM
  Subject: Re: HBase mention in VLDB keynote

 hi andy,

 Even though current HBase is not yet ready for production, but we know it
 is
 really testable and evaluation-able for its data model and architecture.

 Regards ...and JG's/Ryan's performance test results for 0.20 stand as a
 contradiction. Can you provide more references? such as a url/link of
 these
 contradiction?

 Regards Hive, it's really a good design, especially about its abatraction
 of
 MapReduce workflow matched to SQL. Hive made a good success inside
 Facebook, the report says 29% of Facebook employees use Hive, and 51% of
 those users are from outside engineering. It should be caused by the easy
 leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
 adding features of metadata and sql, which are provided in Hive. But Hive
 is
 still not very flexible to use alternate data store than HDFS files. We are
 doing a interesting thing to make Hive can use HBase as it's data store.
 Now
 we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
 can directly query/scan data from HBase.

 I believe HBase can be a data store to work as a storage adapter layer
 above
 HDFS. It is not a database, it is just a data storage adapter system above
 HDFS, with a distributed b-tree clustered index. BigTable is designed to
 provide more easy-used ways to store small data objects and provide
 random-access, since GFS is designed for
 sequential-access/batch-processing/large-data storage and GFS is not
 appropriate to store small data objects and random-access.

 I also believe HBase can be a data store to let MapReduce over HBase
 possiable. If we review the Bigtable paper's, especially secetor 8, we can
 find it is widely used for to do mapreduce analysis/summary in many google
 applications.


 In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
 can
 find google's new GFS integrated some data models of Bigtable.
 http://queue.acm.org/detail.cfm?id=1594206


 Schubert

 On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens 
 bradfordsteph...@gmail.com wrote:

 Interesting. I need to see what sort of eval was going on for that
 presentation...

 He probably forgot to tweak GC :)

 On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell apurt...@apache.org
 wrote:

 Can we write him to figure more on how evaluation was done?

 This was one interaction with that group, maybe the only other one
 aside
 from a question about sizing memstore:
 http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
 Now I wonder if the eval was done via the REST gateway... A followup
 might
 be useful. If I run into someone 

Re: hbase/jython outdated

2009-08-26 Thread Andrei Savu
I have fixed the code samples and opened a feature request on JIRA for
the jython command.

https://issues.apache.org/jira/browse/HBASE-1796

Until recently I have used the python thrift interface but it has some
serious issues with unicode.
Currently I am searching for alternatives.

Is there any python library for REST interface? How stable is the REST
interface?

On Tue, Aug 25, 2009 at 4:18 PM, Jean-Daniel Cryansjdcry...@apache.org wrote:
 I can edit this page just fine but you have to be logged in to do
 that, anyone can sign in.

 Thx!

 J-D

 On Tue, Aug 25, 2009 at 7:02 AM, Andrei Savusavu.and...@gmail.com wrote:
 Hi,

 The Hbase/Jython ( http://wiki.apache.org/hadoop/Hbase/Jython ) wiki
 page is outdated.
 I want to edit it but the page is marked as immutable.

 I have attached a working sample and a patched version of bin/hbase
 with the jython command added.

 --
 Savu Andrei

 Website: http://www.andreisavu.ro/





-- 
Savu Andrei

Website: http://www.andreisavu.ro/


Settings

2009-08-26 Thread Lars George

Hi,

It seems over the years I tried various settings in both Hadoop and 
HBase and when redoing a cluster it is always a question if we should 
keep that setting or not - since the issue it suppressed was fixed 
already. Maybe we should have a wiki page with the current settings and 
more advanced ones and when and how to use them. I find often that the 
description itself in the various default files are often as ambiguous 
as the setting key itself.


Here a list of the not so obvious settings and what I set them as - 
please help me identifying which are useful or actually obsolete.


HBase:
-

- fs.default.name = hdfs://master-hostname:9000/

This is usually in core-site.xml in Hadoop. Is the client or server 
needing this key at all? Did I copy it in the hbase site file by mistake?


- hbase.cluster.distributed = true

For true replication and stand alone ZK installations.

- dfs.datanode.socket.write.timeout = 0

This is used in DataNode but here more importantly in DFSClient. Its 
default is fixed to apparently 8 minutes, no default file (I would have 
assumed hdfs-default.xml) has it listed.


We set it to 0 to avoid the socket timing out on low use etc. because 
the DFSClient reconnect is not handled gracefully. I trust setting it to 
0 is what we recommend for HBase and is still valid?


- hbase.regionserver.lease.period = 60

Default was changed from 60 to 120 seconds. Over time I had issues and 
have set it to 10mins. Good or bad?


- hbase.hregion.memstore.block.multiplier = 4

This is up from the default 2. Good or bad?

- hbase.hregion.max.filesize = 536870912

Again twice as much as the default. Opinions?

- hbase.regions.nobalancing.count = 20

This seems to be missing from the hbase-default.xml but is set to 4 in 
the code if not specified. The above I got from Ryan to improve startup 
of HBase. It means that while a RS is still opening up to 20 regions it 
can start rebalance regions. Handled by the ServerManager during message 
processing. Opinions?


- hbase.regions.percheckin = 20

This is the count of regions assigned in one go. Handled in 
RegionmManager and the default is 10. Here we tell it to assign regions 
in larger batches to speed up the cluster start. Opinions?


- hbase.regionserver.handler.count = 30

Up from 10 as I had often the problem that the UI was not responsive 
while a import MR job would run. All handlers were busy doing the 
inserts. JD mentioned it may be set to a higher default value?



Hadoop:
--

- dfs.block.size = 134217728

Up from the default 64MB. I have done this in the past as my data size 
per cell is larger than the usual few bytes. I can have a few KB up to 
just above 1 MB per value. Still making sense?


- dfs.namenode.handler.count = 20

This was upped from the default 10 quite some time ago (more than a year 
ago). So is this still required?


- dfs.datanode.socket.write.timeout = 0

This is the matching entry to the above I suppose. This time for the 
DataNode. Still required?


- dfs.datanode.max.xcievers = 4096

Default is 256 and often way to low. What is a good value you would use? 
What is the drawback setting it high?



Thanks,
Lars



Re: Hbase 0.20 example\manual

2009-08-26 Thread Alex Spodinets
Gents,

It appears that example for mapred in hbase 0.20RC1 source uses alot of
deprecated classes. Is it true to assume that it is out of date ? If so,
could anyone point me to example for mapred of 0.20 ?

Thanks,
Alex

On Tue, Aug 18, 2009 at 2:26 AM, Alex Spodinets spodin...@gmail.com wrote:

 exciting, thanks.


 On Tue, Aug 18, 2009 at 2:05 AM, Jonathan Gray jl...@streamy.com wrote:

 Look at the overview/summary in the javadocs.

 I'm not sure if an official one has been posted yet, but you can check out
 the Getting Started guide here:

 http://jgray.la/javadoc/hbase-0.20.0/overview-summary.html

 And API examples here:


 http://jgray.la/javadoc/hbase-0.20.0/org/apache/hadoop/hbase/client/package-summary.html

 JG


 Alex Spodinets wrote:

 Hello,

 Could somone kindly point me to an example of HBase 0.20 API usage. All i
 was able to find so far is a Map\Reduce example in the 0.20 SVN source.
 Would be also good to have some info on how 0.20 should be installed,
 especially the zoo keeper.

 Thanks.





Re: Hbase 0.20 example\manual

2009-08-26 Thread stack
See under http://people.apache.org/~stack/hbase-0.20.0-candidate-2/docs/.
The client code is linked from the 'Getting Started' section.  Here is
direct link: http://su.pr/Anqe9D
St.Ack

On Wed, Aug 26, 2009 at 9:10 AM, Alex Spodinets spodin...@gmail.com wrote:

 Gents,

 It appears that example for mapred in hbase 0.20RC1 source uses alot of
 deprecated classes. Is it true to assume that it is out of date ? If so,
 could anyone point me to example for mapred of 0.20 ?

 Thanks,
 Alex

 On Tue, Aug 18, 2009 at 2:26 AM, Alex Spodinets spodin...@gmail.com
 wrote:

  exciting, thanks.
 
 
  On Tue, Aug 18, 2009 at 2:05 AM, Jonathan Gray jl...@streamy.com
 wrote:
 
  Look at the overview/summary in the javadocs.
 
  I'm not sure if an official one has been posted yet, but you can check
 out
  the Getting Started guide here:
 
  http://jgray.la/javadoc/hbase-0.20.0/overview-summary.html
 
  And API examples here:
 
 
 
 http://jgray.la/javadoc/hbase-0.20.0/org/apache/hadoop/hbase/client/package-summary.html
 
  JG
 
 
  Alex Spodinets wrote:
 
  Hello,
 
  Could somone kindly point me to an example of HBase 0.20 API usage. All
 i
  was able to find so far is a Map\Reduce example in the 0.20 SVN source.
  Would be also good to have some info on how 0.20 should be installed,
  especially the zoo keeper.
 
  Thanks.
 
 
 



Re: Hbase 0.20 example\manual

2009-08-26 Thread bharath vissapragada
hi ,

I saw the tableindexed package here

http://people.apache.org/~stack/hbase-0.20.0-candidate-1/docs/api/org/apache/hadoop/hbase/client/tableindexed/package-summary.html

I have a doubt ...

Suppose i have the following tab;e

rowkey  col
1a
2a
3b
4a
5c
6b

suppose i have to index on col  .. so my secondary index should be
somewhat as follows

key  val(s)
a1,2,4
b 3,6
c   5

Does this new tableindexed allow such kind of repetitions on the column to
be indexed ?
Some people have said that , in 19.x,the value of the column on wch we are
indexing should always be distinct.

Does 0.20 adds any support to the one's of above kind?



On Tue, Aug 18, 2009 at 4:36 AM, stack st...@duboce.net wrote:

 Does this help?


 http://people.apache.org/~stack/hbase-0.20.0-candidate-1/docs/api/overview-summary.html#overview_descriptionhttp://people.apache.org/%7Estack/hbase-0.20.0-candidate-1/docs/api/overview-summary.html#overview_description

 Includes sample client usage and all about zk + hbase.

 St.Ack


 On Mon, Aug 17, 2009 at 3:57 PM, Alex Spodinets spodin...@gmail.com
 wrote:

  Hello,
 
  Could somone kindly point me to an example of HBase 0.20 API usage. All i
  was able to find so far is a Map\Reduce example in the 0.20 SVN source.
  Would be also good to have some info on how 0.20 should be installed,
  especially the zoo keeper.
 
  Thanks.
 



Re: Hbase 0.20 example\manual

2009-08-26 Thread Alex Spodinets
St.Ack,

That is a client example. I'm hoping to get Map\Reduce example, have it
handy ?

Thanks,
Alex

On Wed, Aug 26, 2009 at 7:27 PM, stack st...@duboce.net wrote:

 See under 
 http://people.apache.org/~stack/hbase-0.20.0-candidate-2/docs/http://people.apache.org/%7Estack/hbase-0.20.0-candidate-2/docs/
 .
 The client code is linked from the 'Getting Started' section.  Here is
 direct link: http://su.pr/Anqe9D
 St.Ack

 On Wed, Aug 26, 2009 at 9:10 AM, Alex Spodinets spodin...@gmail.com
 wrote:

  Gents,
 
  It appears that example for mapred in hbase 0.20RC1 source uses alot of
  deprecated classes. Is it true to assume that it is out of date ? If so,
  could anyone point me to example for mapred of 0.20 ?
 
  Thanks,
  Alex
 
  On Tue, Aug 18, 2009 at 2:26 AM, Alex Spodinets spodin...@gmail.com
  wrote:
 
   exciting, thanks.
  
  
   On Tue, Aug 18, 2009 at 2:05 AM, Jonathan Gray jl...@streamy.com
  wrote:
  
   Look at the overview/summary in the javadocs.
  
   I'm not sure if an official one has been posted yet, but you can check
  out
   the Getting Started guide here:
  
   http://jgray.la/javadoc/hbase-0.20.0/overview-summary.html
  
   And API examples here:
  
  
  
 
 http://jgray.la/javadoc/hbase-0.20.0/org/apache/hadoop/hbase/client/package-summary.html
  
   JG
  
  
   Alex Spodinets wrote:
  
   Hello,
  
   Could somone kindly point me to an example of HBase 0.20 API usage.
 All
  i
   was able to find so far is a Map\Reduce example in the 0.20 SVN
 source.
   Would be also good to have some info on how 0.20 should be installed,
   especially the zoo keeper.
  
   Thanks.
  
  
  
 



Re: Hbase 0.20 example\manual

2009-08-26 Thread Lars George

Alex,

Check the org.apache.hadoop.hbase.mapreduce package. It has the updated 
API and new classes. The legacy mapred package is deprecated. If you 
like to see an example then check out the RowCounter class.


Lars


Alex Spodinets wrote:

St.Ack,

That is a client example. I'm hoping to get Map\Reduce example, have it
handy ?

Thanks,
Alex

On Wed, Aug 26, 2009 at 7:27 PM, stack st...@duboce.net wrote:

  

See under 
http://people.apache.org/~stack/hbase-0.20.0-candidate-2/docs/http://people.apache.org/%7Estack/hbase-0.20.0-candidate-2/docs/
.
The client code is linked from the 'Getting Started' section.  Here is
direct link: http://su.pr/Anqe9D
St.Ack

On Wed, Aug 26, 2009 at 9:10 AM, Alex Spodinets spodin...@gmail.com
wrote:



Gents,

It appears that example for mapred in hbase 0.20RC1 source uses alot of
deprecated classes. Is it true to assume that it is out of date ? If so,
could anyone point me to example for mapred of 0.20 ?

Thanks,
Alex

On Tue, Aug 18, 2009 at 2:26 AM, Alex Spodinets spodin...@gmail.com
wrote:

  

exciting, thanks.


On Tue, Aug 18, 2009 at 2:05 AM, Jonathan Gray jl...@streamy.com


wrote:
  

Look at the overview/summary in the javadocs.

I'm not sure if an official one has been posted yet, but you can check
  

out
  

the Getting Started guide here:

http://jgray.la/javadoc/hbase-0.20.0/overview-summary.html

And API examples here:



  

http://jgray.la/javadoc/hbase-0.20.0/org/apache/hadoop/hbase/client/package-summary.html


JG


Alex Spodinets wrote:

  

Hello,

Could somone kindly point me to an example of HBase 0.20 API usage.


All


i
  

was able to find so far is a Map\Reduce example in the 0.20 SVN


source.


Would be also good to have some info on how 0.20 should be installed,
especially the zoo keeper.

Thanks.





  


Re: HBase mention in VLDB keynote

2009-08-26 Thread stack
On Tue, Aug 25, 2009 at 7:05 PM, Schubert Zhang zson...@gmail.com wrote:

 Thanks JG. We are trying to load up our datasets now.  But one thing's for
 sure that the cluster will become slow while dataset become larger and
 larger. It is distinct on writes and random read.


What kinda of sizes are you talking of Schubert and can you figure where the
slowdown is?
St.Ack


Re: Hbase 0.20 example\manual

2009-08-26 Thread stack
On Wed, Aug 26, 2009 at 9:35 AM, Alex Spodinets spodin...@gmail.com wrote:

 St.Ack,

 That is a client example. I'm hoping to get Map\Reduce example, have it
 handy ?


Sorry about that.  Yeah, what Lars said (I just committed a patch that
clears out the old example with deprecated code and instead points you to
RowCounter as example of how to use new api).
St.Ack


Re: Hbase 0.20 example\manual

2009-08-26 Thread Alex Spodinets
Got it, Thanks.

On Wed, Aug 26, 2009 at 9:16 PM, stack st...@duboce.net wrote:

 On Wed, Aug 26, 2009 at 9:35 AM, Alex Spodinets spodin...@gmail.com
 wrote:

  St.Ack,
 
  That is a client example. I'm hoping to get Map\Reduce example, have it
  handy ?
 

 Sorry about that.  Yeah, what Lars said (I just committed a patch that
 clears out the old example with deprecated code and instead points you to
 RowCounter as example of how to use new api).
 St.Ack



Will ROOT region be a bottleneck?

2009-08-26 Thread y_823910
Hi,
The HBaseMaster is responsible for assigning regions to HRegionServers.
The first region to be assigned is the ROOT region.
The ROOT region is served by a region server, right?
Will it be a bottleneck? While many clients request at the same time.
Thanks

Fleming
 --- 
 TSMC PROPERTY   
 This email communication (and any attachments) is proprietary information   
 for the sole use of its 
 intended recipient. Any unauthorized review, use or distribution by anyone  
 other than the intended 
 recipient is strictly prohibited.  If you are not the intended recipient,   
 please notify the sender by 
 replying to this email, and then delete this email and any copies of it 
 immediately. Thank you. 
 --- 





Re: Will ROOT region be a bottleneck?

2009-08-26 Thread Ryan Rawson
While it seems like ROOT might be a bottleneck, with aggressive client
caching it ends up not being an issue.  Clients cache the location of
ROOT, then the cache the location of META and the locations of the
user tables.  All is well.

-ryan

On Wed, Aug 26, 2009 at 5:43 PM, y_823...@tsmc.com wrote:
 Hi,
 The HBaseMaster is responsible for assigning regions to HRegionServers.
 The first region to be assigned is the ROOT region.
 The ROOT region is served by a region server, right?
 Will it be a bottleneck? While many clients request at the same time.
 Thanks

 Fleming
  ---
                                                         TSMC PROPERTY
  This email communication (and any attachments) is proprietary information
  for the sole use of its
  intended recipient. Any unauthorized review, use or distribution by anyone
  other than the intended
  recipient is strictly prohibited.  If you are not the intended recipient,
  please notify the sender by
  replying to this email, and then delete this email and any copies of it
  immediately. Thank you.
  ---






Re: Seattle / NW Hadoop, HBase Lucene, etc. Meetup , Wed August 26th, 6:45pm

2009-08-26 Thread Bradford Stephens

Hello,

My apologies, but there was a mix-up reserving our meeting location,  
and we don't have access to it.


I'm very sorry, and beer is on me next month. Promise :)

Sent from my Internets

On Aug 25, 2009, at 4:21 PM, Bradford Stephens bradfordsteph...@gmail.com 
 wrote:



Hey there,

Apologies for this not going out sooner -- apparently it was sitting
as a draft in my inbox. A few of you have pinged me, so thanks for
your vigilance.

It's time for another Hadoop/Lucene/Apache Stack meetup! We've had
great attendance in the past few months, let's keep it up! I'm always
amazed by the things I learn from everyone.

We're back at the University of Washington, Allen Computer Science
Center (not Computer Engineering)
Map: http://www.washington.edu/home/maps/?CSE

Room: 303 -or- the Entry level. If there are changes, signs will be  
posted.


More Info:

The meetup is about 2 hours: we'll have two in-depth talks of 15-20
minutes each, and then several lightning talks of 5 minutes. If no
one offers, We'll then have discussion and 'social time'.  we'll just
have general discussion. Let net know if you're interested in speaking
or attending. We'd like to focus on education, so every presentation
*needs* to ask some questions at the end. We can talk about these
after the presentations, and I'll record what we've learned in a wiki
and share that with the rest of us.

Contact: Bradford Stephens, 904-415-3009, bradfordsteph...@gmail.com

--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science


Re: Settings

2009-08-26 Thread Schubert Zhang

  HBase:
 -
 - fs.default.name = hdfs://master-hostname:9000/

 This is usually in core-site.xml in Hadoop. Is the client or server needing
 this key at all? Did I copy it in the hbase site file by mistake?


 [schubert] I think it's better not to copy it into HBase conf file. I
suggest you modify you hbase-env.sh to add the conf path of hadoop into you
HBASE_CLASSPATH, e.g. export
HBASE_CLASSPATH=${HBASE_HOME}/../hadoop-0.20.0/conf.
Except for that, we also should config GC options here.



 - hbase.cluster.distributed = true

 For true replication and stand alone ZK installations.


[schubert] also should export HBASE_MANAGES_ZK=false in hbase-env.sh to make
consistent.




 - dfs.datanode.socket.write.timeout = 0


[schubert] This parameper should be for hadoop, HDFS. It should be in
hadoop-0.20.0/conf/hdfs-site.xml. But I think it should be not useful now.




 This is used in DataNode but here more importantly in DFSClient. Its
 default is fixed to apparently 8 minutes, no default file (I would have
 assumed hdfs-default.xml) has it listed.

 We set it to 0 to avoid the socket timing out on low use etc. because the
 DFSClient reconnect is not handled gracefully. I trust setting it to 0 is
 what we recommend for HBase and is still valid?

 - hbase.regionserver.lease.period = 60

 Default was changed from 60 to 120 seconds. Over time I had issues and have
 set it to 10mins. Good or bad?


[schubert] I think if you select right jvm GC options, the default 6 is
ok.




 - hbase.hregion.memstore.block.multiplier = 4

 This is up from the default 2. Good or bad?


[schubert] I do not think it is necessary, do you describe you reason?




 - hbase.hregion.max.filesize = 536870912

 Again twice as much as the default. Opinions?


[schubert] If you want bigger region size, I think its fine. We
even had tried 1GB in some tests.




 - hbase.regions.nobalancing.count = 20

 This seems to be missing from the hbase-default.xml but is set to 4 in the
 code if not specified. The above I got from Ryan to improve startup of
 HBase. It means that while a RS is still opening up to 20 regions it can
 start rebalance regions. Handled by the ServerManager during message
 processing. Opinions?


[schubert] I think it make sense.




 - hbase.regions.percheckin = 20

 This is the count of regions assigned in one go. Handled in RegionmManager
 and the default is 10. Here we tell it to assign regions in larger batches
 to speed up the cluster start. Opinions?


 [schubert] I have no idea about it. I think the region assignment will
occupy some CPU and memory overheads on regionserver, if there are too many
HLog to be processed.




 - hbase.regionserver.handler.count = 30

 Up from 10 as I had often the problem that the UI was not responsive while
 a import MR job would run. All handlers were busy doing the inserts. JD
 mentioned it may be set to a higher default value?


[schubert] It make sense. I my small 5 nodes cluster, I set it 20.



 Hadoop:
 --

 - dfs.block.size = 134217728

 Up from the default 64MB. I have done this in the past as my data size per
 cell is larger than the usual few bytes. I can have a few KB up to just
 above 1 MB per value. Still making sense?



[schubert] I think you reason make sense.



 - dfs.namenode.handler.count = 20

 This was upped from the default 10 quite some time ago (more than a year
 ago). So is this still required?


[schubert] I also set it 20.




 - dfs.datanode.socket.write.timeout = 0

 This is the matching entry to the above I suppose. This time for the
 DataNode. Still required?


[schubert]  I think it is not necessary now.




 - dfs.datanode.max.xcievers = 4096

 Default is 256 and often way to low. What is a good value you would use?
 What is the drawback setting it high?


[schubert] It should make sense. I use 3072 in my small cluster.






 Thanks,
 Lars



Re: Settings

2009-08-26 Thread stack
On Wed, Aug 26, 2009 at 7:40 AM, Lars George l...@worldlingo.com wrote:

 Hi,

 It seems over the years I tried various settings in both Hadoop and HBase
 and when redoing a cluster it is always a question if we should keep that
 setting or not - since the issue it suppressed was fixed already. Maybe we
 should have a wiki page with the current settings and more advanced ones and
 when and how to use them. I find often that the description itself in the
 various default files are often as ambiguous as the setting key itself.



I'd rather fix the description so its clear rather than add extra info out
in a wiki; wiki pages tend to rot.



- fs.default.name = hdfs://master-hostname:9000/

 This is usually in core-site.xml in Hadoop. Is the client or server needing
 this key at all? Did I copy it in the hbase site file by mistake?



There probably was a reason long ago but, yeah, you shouldn't need this (as
Schubert says).



 - hbase.cluster.distributed = true

 For true replication and stand alone ZK installations.

 - dfs.datanode.socket.write.timeout = 0

 This is used in DataNode but here more importantly in DFSClient. Its
 default is fixed to apparently 8 minutes, no default file (I would have
 assumed hdfs-default.xml) has it listed.

 We set it to 0 to avoid the socket timing out on low use etc. because the
 DFSClient reconnect is not handled gracefully. I trust setting it to 0 is
 what we recommend for HBase and is still valid?



For background on this, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#6.  It shouldn't be
needed anymore, especially with hadoop-4681 in place but IIRC, apurtell had
trouble bringing up a cluster one time when it shouldn't have been needed
but the only way to get it up was to set this to zero.   We should test.
BTW, this is a client-side config.  You have it below in hadoop.  Shouldn't
be needed there, not by hbase at least (maybe you have other clients of hdfs
that had this issue?).




 - hbase.regionserver.lease.period = 60

 Default was changed from 60 to 120 seconds. Over time I had issues and have
 set it to 10mins. Good or bad?



There is an issue to check that this is even used any more. Lease is in zk
now.  I don't think this has an effect any more.



 - hbase.hregion.memstore.block.multiplier = 4

 This is up from the default 2. Good or bad?



Means that we'll fill more RAM before we bring down the writes gate, up to
2x the flush size (So if 64MB is default time to flush, we'll keep taking on
writes till we get to 2x64MB).  2x is good for the 64M default I'd say --
especially during virulent upload with lots of Stores.




 - hbase.hregion.max.filesize = 536870912

 Again twice as much as the default. Opinions?


Means you should have less regions overall for perhaps some small compromise
in performance (TBD).  I think that in 0.21 we'll likely up the region
default size to this or larger.  Need to test.  Leave it I'd say if
performance is OK for you and if you have lots of regions.


 - hbase.regions.nobalancing.count = 20

 This seems to be missing from the hbase-default.xml but is set to 4 in the
 code if not specified. The above I got from Ryan to improve startup of
 HBase. It means that while a RS is still opening up to 20 regions it can
 start rebalance regions. Handled by the ServerManager during message
 processing. Opinions?



If it works for you, keep it.  This whole startup and region reassignment is
going to be redone in 0.21.  These configurations will likely change at that
time.




 - hbase.regions.percheckin = 20

 This is the count of regions assigned in one go. Handled in RegionmManager
 and the default is 10. Here we tell it to assign regions in larger batches
 to speed up the cluster start. Opinions?


See previous note.





 - hbase.regionserver.handler.count = 30

 Up from 10 as I had often the problem that the UI was not responsive while
 a import MR job would run. All handlers were busy doing the inserts. JD
 mentioned it may be set to a higher default value?


No harm here.  Do the math.  Is it likely that you'll have 30 clients
concurrently trying to get stuff out of a regionserver?  If so, keep it I'd
say.





 Hadoop:
 --

 - dfs.block.size = 134217728

 Up from the default 64MB. I have done this in the past as my data size per
 cell is larger than the usual few bytes. I can have a few KB up to just
 above 1 MB per value. Still making sense?



No opinion.  Whatever works for you.




 - dfs.namenode.handler.count = 20

 This was upped from the default 10 quite some time ago (more than a year
 ago). So is this still required?


Probably.  Check it during a time of high load.  Are all in use?




 - dfs.datanode.socket.write.timeout = 0

 This is the matching entry to the above I suppose. This time for the
 DataNode. Still required?


See comment near top.





 - dfs.datanode.max.xcievers = 4096

 Default is 256 and often way to low. What is a good value you would use?
 What is the drawback setting it