Re: HBase Performance Improvements?

2012-05-10 Thread Something Something
Thank you Tim & Bryan for the responses. Sorry for the delayed response. Got busy with other things. Bryan - I decided to focus on the region split problem first. The challenge here is to find the correct start key for each region, right? Here are the steps I could think of: 1) Sort the keys.

Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Eran Kutner
Hi, We're seeing occasional regionserver crashes during heavy write operations to Hbase (at the reduce phase of large M/R jobs). I have increased the file descriptors, HDFS xceivers, HDFS threads to the recommended settings and actually way above. Here is an example of the HBase log (showing only

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Igal Shilman
Hi Eran, Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ? (We have set this to zero in our cluster, which means waiting as long as necessary for the write to complete) Igal. On Thu, May 10, 2012 at 11:17 AM, Eran Kutner wrote: > Hi, > We're seeing occasional regionserver cr

Re: Looking for a single row - HTable.get(Get) or Scan(Get)

2012-05-10 Thread Igal Shilman
I think that it is also worth mentioning, that Gets can be much more I/O efficient then Scans if you have bloom filters enabled. On Thu, May 10, 2012 at 12:29 AM, Doug Meil wrote: > > Also, there is multi-Get support as of 0.90.x to further optimize the RPC > calls if you need to make a bunch of

Re: HRegionPartitioner breaks my configuration

2012-05-10 Thread Christoph Bauer
Hi, thank you for your input. I've been doing it exactly that way. Jars appear in the classpath without problems. But I am unable to transfer hbase-site.xml to the mapper's classpath. So now I will try adding hbase-site.xml to the CP hadoop-env.sh. The addDependencyJars mechanism does not work f

Re: HBase Performance Improvements?

2012-05-10 Thread Oliver Meyn (GBIF)
Heya Something, I've put my ImportAvro class up for your amusement. It's a maven project so you should be able to check it out, build the jar with dependencies, and then just run it. See the Readme for more details. http://code.google.com/p/gbif-labs/source/browse/import-avro/ For your tabl

Re: HRegionPartitioner breaks my configuration

2012-05-10 Thread Harsh J
Christoph, I still don't get your issue though. A stack trace of the error thrown out by the components you used would be good to have :) On Thu, May 10, 2012 at 3:38 PM, Christoph Bauer wrote: > Hi, > thank you for your input. > > I've been doing it exactly that way. Jars appear in the classpat

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Eran Kutner
Thanks Igal, but we already have that setting. These are the relevant setting from hdfs-site.xml : dfs.datanode.max.xcievers 65536 dfs.datanode.handler.count 10 dfs.datanode.socket.write.timeout 0 Other ideas? -eran On Thu, May 10, 2012 at 12:25 PM, Ig

Re: HRegionPartitioner breaks my configuration

2012-05-10 Thread Christoph Bauer
2012/5/10 Harsh J : > Christoph, > > I still don't get your issue though. A stack trace of the error thrown > out by the components you used would be good to have :) ok. ;) $export HBASE_CONF_DIR=/etc/hbase/conf/ $export HADOOP_CLASSPATH=`hbase classpath` $hadoop jar test.jar simple.HtablePartitio

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michel Segel
Silly question... Why are you using a reducer when working w HBase? Second silly question... What is the max file size of your table that you are writing to? Third silly question... How many regions are on each of your region servers Fourth silly question ... There is this bandwidth setting...

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Eran Kutner
Hi Mike, Not sure I understand the question about the reducer. I'm using a reducer because my M/R jobs require one and I want to write the result to Hbase. I have two tables I'm writing two, one is using the default file size (256MB if I remember correctly) the other one is 512MB. There are ~700 re

Re: How to run two data nodes on one pc?

2012-05-10 Thread shashwat shriparv
Best Option Insatll two ubuntu in vm and use that :) On Thu, May 10, 2012 at 12:49 AM, waled tayiib wrote: > Im trying to test my MapReduce program on more than one node. > and I only have one computer that has 2 processors. > > > > > From: Michael Segel > To:

Re: How to run two data nodes on one pc?

2012-05-10 Thread Harsh J
Waled, I've covered running multiple daemons on a single node (or so called "expanded pseudo-distributed mode"? :)) briefly at http://search-hadoop.com/m/j2iwg1Re6gk with some configs added. On Thu, May 10, 2012 at 12:49 AM, waled tayiib wrote: > Im trying to test my MapReduce program on more th

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michael Segel
Ok... So the issue is that you have a lot of regions on a region server, where the max file size is the default. On your input to HBase, you have a couple of issues. 1) Your data is most likely sorted. (Not good on inserts) 2) You will want to increase your region size from default (256MB) to

Re: Looking for a single row - HTable.get(Get) or Scan(Get)

2012-05-10 Thread Bryan Beaudreault
I was under the impression that a single-row Scan can use the bloom filter as well. Can anyone verify/refute? On Thu, May 10, 2012 at 5:31 AM, Igal Shilman wrote: > I think that it is also worth mentioning, that Gets can be much more I/O > efficient then Scans if you have bloom filters enabled.

Re: HBase Performance Improvements?

2012-05-10 Thread Bryan Beaudreault
Since our Key was ImmutableByteWritable (representing a rowKey) and the Value was KeyValue, there could be many KeyValue's per row key (thus values per hadoop key in the reducer). So yes, what we did is very much the same as what you described. Hadoop will sort the ImutableByteWritable keys befor

Re: Confusing questions ! Hadoop Beginner

2012-05-10 Thread yavuz gokirmak
Hi, I am not an expert but can give some ideas. (Correct me if I am wrong please :) ) Regardless of whether you use hbase or hive, data is stored HDFS at the end of the day. What hive provides is an sql interface over raw data. When you load data to hive; you define its fields, columns and parsi

Re: Looking for a single row - HTable.get(Get) or Scan(Get)

2012-05-10 Thread Harsh J
Bryan, Yes, that is true. Essentially Scan.isGetScan() [0] needs to pass before bloom filter checks proceed. And that method passes for single row scans as expected. Multi-gets support bloom filter lookups as well (although is implemented in a different way). [0] - http://hbase.apache.org/apido

How to start up datanode with kerberos?

2012-05-10 Thread shixing
Hi,all: Now I want to setup the security with hbase by kerberos. As I know, the hbase's ugi is based on the hadoop UserGroupInformation without parameter "hadoop.job.ugi" after 0.20.2. So when I use the cdh3u3, the ugi can be generated by two authentication : simple or kerberos. Firstly

Re: Confusing questions ! Hadoop Beginner

2012-05-10 Thread shashwat shriparv
Use hbase to store the data map the tables from hbase to hive to do mapreduce sql queries. the data will be stored at the hdfs only for hbase also and for hive also. Two option : Write map reduce jobs to fetch data from hbase. or map the tables of hbase to hive(Create externals tables where it wi

Re: HBase Performance Improvements?

2012-05-10 Thread Something Something
I am beginning to get a sinking feeling about this :( But I won't give up! Problem is that when I use one Reducer the job runs for a long time. I killed it after about an hour. Keep in mind, we do have a decent cluster size. The Map stage completes in a minute & when I set no. of reducers to 0

Re: Switching existing table to Snappy possible?

2012-05-10 Thread Jeff Whiting
We really need to be able to do this type of thing online. Taking your table down just so you can change the compression/bloom/whatever isn't very cool for a production cluster. My $0.02 ~Jeff On 5/9/2012 10:10 PM, Harsh J wrote: Jiajun, Expanding on Jean's guideline (and perhaps the follow

Re: HBase Performance Improvements?

2012-05-10 Thread Bryan Beaudreault
I don't think there is. You need to have a table seeded with the right regions in order to run the bulk loader jobs. My machines are sufficiently fast that it did not take that long to sort. One thing I did do to speed this up was add a mapper to the job that generates the splits, which would c

Re: Switching existing table to Snappy possible?

2012-05-10 Thread Harsh J
Jeff, Are you looking for "hbase.online.schema.update.enable"? Its still considered experimental (though I dunno its current state on trunk) to update schemas "on-line" currently. See its description on http://hbase.apache.org/book.html, enable and let us know! On Thu, May 10, 2012 at 9:32 PM, J

Re: Switching existing table to Snappy possible?

2012-05-10 Thread Amandeep Khurana
>From what I understand, the online schema update feature (0.92.x onwards) would allow you to do this without disabling tables. It's experimental in 0.92. On Thu, May 10, 2012 at 9:02 AM, Jeff Whiting wrote: > We really need to be able to do this type of thing online. Taking your > table down j

Re: Switching existing table to Snappy possible?

2012-05-10 Thread Jean-Daniel Cryans
I've tested it extensively, it's safe to use if your tables aren't splitting like mad eg don't try to do it during a massive import because you forgot to set the memstore size. J-D On Thu, May 10, 2012 at 9:11 AM, Amandeep Khurana wrote: > From what I understand, the online schema update feature

Re: Best way to profile a coprocessor

2012-05-10 Thread Gary Helmling
> What is the best way to profile some co-processor code (running on the > regionserver)? If you have done it successfully, what tips can you > offer, and what unexpected problems did you encounter? > It depends on what exactly you want to look at, but ultimately I don't think it's too different f

Re: Switching existing table to Snappy possible?

2012-05-10 Thread Jeff Whiting
Good to know thanks for the feedback! On 5/10/2012 10:32 AM, Jean-Daniel Cryans wrote: extensively, it's safe to use if your tables aren't splitting like mad eg don't try to do it during a massive import because you forgot to set the memstore size. -- Jeff Whiting Qualtrics Senior Software E

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Dave Revell
This "you don't need a reducer" conversation is distracting from the real problem and is false. Many mapreduce algorithms require a reduce phase (e.g. sorting). The fact that the output is written to HBase or somewhere else is irrelevant. -Dave On Thu, May 10, 2012 at 6:26 AM, Michael Segel wrot

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michael Segel
Dave, do you really want to go there? OP has a couple of issues and he was going down a rabbit hole. (You can choose if that's a reference to 'the Matrix, Jefferson Starship, Alice in Wonderland... or all of the above) So to put him on the correct path, I recommended the following, not in any

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Dave Revell
Some examples of when you'd want a reducer: http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf On Thu, May 10, 2012 at 11:30 AM, Michael Segel wrote: > Dave, do you really want to go there? > > OP has a couple of issues and he was going down a rabbit hole. > (You can choose if t

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michael Segel
Sigh. Dave, I really think you need to think more about the problem. Think about what a reduce does and then think about what happens in side of HBase. Then think about which runs faster... a job with two mappers writing the intermediate and final results in HBase, or a M/R job that writes it

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Eran Kutner
Michale I appreciate the feedback but I'd have to disagree. In my case for example, I need to look at a complete set of data produced by the map phase in order to make a decision and write it to Hbase. So sure I could write all the mappers output to hbase then have another map only job to scan the

Re: How to run two data nodes on one pc?

2012-05-10 Thread waled tayiib
Thank you very much. That is very useful. But for the run-additionalDN.sh file in  this link http://search-hadoop.com/m/a4klk28NUr12, what did you do to fix it since you got an error. and i keep getting the error "shift: 24 : Cant shift that many"  Also for the config files in this link   

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michael Segel
Eran, see my response inline... On May 10, 2012, at 2:17 PM, Eran Kutner wrote: > Michale I appreciate the feedback but I'd have to disagree. > In my case for example, I need to look at a complete set of data produced > by the map phase in order to make a decision and write it to Hbase. So sure

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Stack
On Thu, May 10, 2012 at 11:59 AM, Michael Segel wrote: > Sigh. > > Dave, > I really think you need to think more about the problem. > > Think about what a reduce does and then think about what happens in side of > HBase. > > Then think about which runs faster... a job with two mappers writing the

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michael Segel
Stack, That section was written by Doug after he and I had the same debate man moons ago. While I can't say with absolute certainty that you shouldn't use a reducer, I can say is that every situation where I have seen a M/R where you are writing to HBase, you end up not wanting to use a reduce

HBase bulk loaded region can't be splitted

2012-05-10 Thread Bruce Bian
I use importtsv to load data as HFile hadoop jar hbase-0.92.1.jar importtsv -Dimporttsv.bulk.output=/outputs/mytable.bulk -Dimporttsv.columns=HBASE_ROW_KEY,ns: -Dimporttsv.separator=, mytable /input Then I use completebulkload to load those bulk data into my table hadoop jar hbase-0.92.1.jar com

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michael Segel
Stack, Since you brought it up... > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink. "Writing, it may make sense to avoid the reduce step and write yourself back into HBase from inside your map. You'd do this when your job does not need the sort and

HBase bulk loaded region can't be splitted

2012-05-10 Thread Bruce Bian
I use importtsv to load data as HFile hadoop jar hbase-0.92.1.jar importtsv -Dimporttsv.bulk.output=/outputs/mytable.bulk -Dimporttsv.columns=HBASE_ROW_KEY,ns: -Dimporttsv.separator=, mytable /input Then I use completebulkload to load those bulk data into my table hadoop jar hbase-0.92.1.jar com

Re: HBase bulk loaded region can't be splitted

2012-05-10 Thread Bryan Beaudreault
I haven't done bulk loads using the importtsv tool, but I imagine it works similarly to the mapreduce bulk load tool we are provided. If so, the following stands. In order to do a bulk load you need to have a table ready to accept the data. The bulk load does not create regions, but only puts da

Re: HBase bulk loaded region can't be splitted

2012-05-10 Thread Bruce Bian
Yes, I understand that. But after I complete the bulk load, shouldn't it trigger the region server to split that region in order to meet the *hbase*.*hregion*.*max*.*filesize * criteria? When I try to split the regions manually using the WebUI, nothing happened, but instead a Region mytable,,13342

Re: How to run two data nodes on one pc?

2012-05-10 Thread Harsh J
Waled, Since its the same machine, using "localhost" (which refers automatically to 127.0.0.1) and those explicit port numbers should be fine to use. For the DN startup, see my last mail in that link I sent you for an alternative approach instead of using that script. On Thu, May 10, 2012 at 11:

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Stack
On Thu, May 10, 2012 at 6:28 PM, Michael Segel wrote: > That section was written by Doug after he and I had the same debate man moons > ago. I'm not sure that is correct. If you git blame that section, you'll see that stack and andrew are the authors and that the edits were made in 2009 and 20

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Stack
On Thu, May 10, 2012 at 7:46 PM, Michael Segel wrote: > "Writing, it may make sense to avoid the reduce step and write yourself back > into HBase from inside your map. You'd do this when your job does not need > the sort and collation that mapreduce does on the map emitted data; on > insert, HB

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Michael Segel
Hmmm. That could be. I don't know what Doug wrote except that I knew he mentioned he updated the docs on it. This is really kind of a basic issue. It just makes sense. As you already point out, you and Andrew already noticed this back in 2009 and 2010. I just don't think you took it far

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Stack
On Thu, May 10, 2012 at 8:44 PM, Michael Segel wrote: > That could be. I don't know what Doug wrote except that I knew he mentioned > he updated the docs on it. > No worries. Can you make an issue and a patch on how you think we should reword the section? We can be stronger in our wording arou

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Stack
On Thu, May 10, 2012 at 1:17 AM, Eran Kutner wrote: > Here is an example of the HBase log (showing only errors): > > 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception  for block > blk_-8928911185099340956_5189425java.io.IOException: Bad re

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Stack
On Thu, May 10, 2012 at 4:33 AM, Eran Kutner wrote: >   >    dfs.datanode.socket.write.timeout >    0 >   > Not timing out is probably not what you want. On problem, you want the client to give up rather than hang till the end of time (There are multiple replicas, hopefully you don't timeout on

Re: Occasional regionserver crashes following socket errors writing to HDFS

2012-05-10 Thread Stack
On Thu, May 10, 2012 at 6:26 AM, Michael Segel wrote:. > 4) google dfs.balance.bandwidthPerSec  I believe its also used by HBase when > they need to move regions. Nah. This is an hdfs setting. HBase don't use it directly. > Speaking of which what happens when HBase decides to move a region? D