Re: Modeling column families

2010-04-24 Thread Erik Holstad
Hi Andrew!

So what I was thinking was just like Andrey is saying
patient-code-date - series:value

where the name of the column can be important  or just something like 1,
depending if
you have more than one value that you want to store for each entry.

Regards Erik


Re: Modeling column families

2010-04-23 Thread Erik Holstad
Hey Andrew!

The storage structure of your data could be pretty much the same choosing
HBase or Cassandra.
You can either do it your way or
Bob-ABP-Timestamp which will give you the advantage of scaling better, since
both HBase and
Cassandra are splitting by rows.

HBase has the advantage of scanning much better than Cassandra and is
therefore more suitable
for the row oriented approach.

Erik


Re: Best way to do a clean update of a row

2010-03-08 Thread Erik Holstad
Hey Ferdy!
There has been a lot of talk about this lately. HBase has a resolution of
milli seconds so
if you do a put and a get in the same milli the put will not be shown.
There are a couple of solutions to this problem. Waiting one milli second
with the put,
setting the timestamps yourself or doing some kinda of swap between two
rows.

Erik


Re: Best way to do a clean update of a row

2010-03-08 Thread Erik Holstad
Hey Ferdy!
Not really sure what you are asking now. But if you do a deleteRow and then
a put in the same
milli second the put will be shadowed by the delete so that it will not
show up when you look
for later, if that makes sense? The reason for this is that deletes are
sorted before puts for the
same timestamp, so for a put to be viewable it needs to have a newer
timestamp than the delete.


-- 
Regards Erik


Re: Best way to do a clean update of a row

2010-03-08 Thread Erik Holstad
Hey Ferdy!

On Mon, Mar 8, 2010 at 8:45 AM, Ferdy ferdy.gal...@kalooga.com wrote:

 Hey,

 Great! That is exactly what I meant. So that implies that firing a Delete
 and a Put right after eachother is a pretty bad practise, if you want the
 Put to persist. Please note, I only need one version. (All my families are
  VERSIONS = '1') .

 I guess I have the following choice of solutions:

 // Solution A: Issue a client-side pause
 htable.delete(delete);
 try {Thread.sleep(10);} catch (InterruptedException e) {}
 htable.put(put);

 But wait, the javadoc for Delete states that if no timestamp is specified,
 the SERVER will use the 'now' time. This means that if the Delete and the
 Put can still be determined to have the same timestamp.

Not, really sure why they would still get the same timestamp if you wait 10
millis on the client, should be the same resolution on the server, right?



 // Solution B: specify timestamps
 long deleteTS = System.currentTimeMillis();
 long putTS = deleteTS+1;
 Delete delete = new Delete(row, deleteTS  null);
 htable.delete(delete);
 Put put = new Put(row);
 put.add(family, column, putTS, value);
 htable.put(put);

 How about this solution? I'm guessing the only disadvantage to this one is:
 A client machine with an incorrectly set systemtime (let's say a few days
 ahead) will not be able to be removed by another machine (with a correct
 systemtime) shortly after, because the deleteTS of the correct client will
 be smaller than the timestamp in the table.

This is the reason that it might be tricky to use your own client timestamp
and makes server setting of timestamps a better option.

But is seems like you have a good understanding of the consequences, so good
luck!



 Regards,
 Ferdy


 Erik Holstad wrote:

 Hey Ferdy!
 Not really sure what you are asking now. But if you do a deleteRow and
 then
 a put in the same
 milli second the put will be shadowed by the delete so that it will not
 show up when you look
 for later, if that makes sense? The reason for this is that deletes are
 sorted before puts for the
 same timestamp, so for a put to be viewable it needs to have a newer
 timestamp than the delete.







-- 
Regards Erik


Re: Does the scanner interface have a previous method?

2010-03-06 Thread Erik Holstad
Hey Hua!
I agree that this would be a nice to have feature for cases when you want to
fetch data in
ascending and descending order. You would need to store any extra indices
just open up
a different scanner.
The best way to do this now though is to store it in a different index.

Good luck!
Erik


Re: Performance with many KV's per row

2010-03-04 Thread Erik Holstad
Hey Al!
There are indexes to the files that you are looking in but as far as I know
not to all rows
which means that you have to scan a few keyvalues before getting to the
one you want.

The BloomFilter would only help in the case of a Get operation, for a Scan
you still need
to open up all files. The only thing that filters and specifying a timestmp
for example does
is to return faster, the approach is still the same.

Hope that helps

-- 
Regards Erik


Re: Multiple get

2010-03-04 Thread Erik Holstad
Hey Slava!
There is work being done to be able to do this
https://issues.apache.org/jira/browse/HBASE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832114#action_12832114
I think that there is also another Jira that s related to this topic, but I
don't know what that number is.

-- 
Regards Erik


Re: Timestamp of specific row and colmun

2010-03-03 Thread Erik Holstad
Hey Slava!
I actually think that iterating the list is going to be faster since you
don't have to create
the map first, but that kinda depends on how you are planning to use the
data afterwards,
but please let me know if you get a different result.
The reason that there isn't a more elegant way of getting the timestamp is
that we
wanted to make everything as bare bones as possible to improve speed.
Creating the map from the list on the client side and getting your
timestamps is still
going to be about 10-100x faster then with your old code, hopefully :)

-- 
Regards Erik


Re: Questions about HBase

2010-03-01 Thread Erik Holstad
Hey William!

On Mon, Mar 1, 2010 at 12:36 PM, William Kang weliam.cl...@gmail.comwrote:

 Hi guys,
 I am new to HBase and have several questions. Would anybody kindly answer
 some of them?

 1. Why HBase could provide a low-latency random access to files compared to
 HDFS?

Have a look at http://wiki.apache.org/hadoop/Hbase and the bigtable paper
for reference on how
to add random access on top of a dfs.



 2. By default, Only a single row at a time may be locked. Is it a single
 client who can only lock one or is it globally can only lock one?  If this
 is the case, by default, will the performance be really bad?

The lock is global to insure that a for example read/update/put action can
take
place. Performance will suffer if you have a lot of clients doing this to
the same
row a the same time. In my experience this is not really that common, but if
this
is something you need have a look at the  ITHBase.


 Many thanks!


 William




-- 
Regards Erik


Re: Questions about HBase

2010-03-01 Thread Erik Holstad
On Mon, Mar 1, 2010 at 2:16 PM, Ryan Rawson ryano...@gmail.com wrote:

 Hi,

 1.  We use in-memory indexes to get fast random reads.  Our index
 tells us to read block X of a file only retrieving a small amount of
 the file to satisfy the user's read.

 2.  The row locking is not global - for each row there can only be 1
 thread doing a put at a time.  This serializes all puts to a single
 row. It is NOT global.


So, this is me not understanding the difference between global locking and
waiting in a line until the tasks ahead of you are done :)


 On Mon, Mar 1, 2010 at 12:36 PM, William Kang weliam.cl...@gmail.com
 wrote:
  Hi guys,
  I am new to HBase and have several questions. Would anybody kindly answer
  some of them?
 
  1. Why HBase could provide a low-latency random access to files compared
 to
  HDFS?
 
  2. By default, Only a single row at a time may be locked. Is it a single
  client who can only lock one or is it globally can only lock one?  If
 this
  is the case, by default, will the performance be really bad?
 
  Many thanks!
 
 
  William
 




-- 
Regards Erik


Re: Row with many columns

2010-02-21 Thread Erik Holstad
Hey Rusian!
Both HBase and Cassandra are as you now know row partitioned, so to get the
most out
of any of these I think you should try to change the structure of your data,
so that you for
example have one object per row, like Ted suggested or in some other way.


-- 
Regards Erik


Re: Thrift api and binary keys

2010-02-09 Thread Erik Holstad
Hey Saptarshi!
Don't know too much about the current status of thrift, but Text can hold
byte[], it is a Hadoop class
which is like a fancier version of a String.
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/io/Text.html

-- 
Regards Erik


Re: Scanner API Question

2009-12-07 Thread Erik Holstad
Hey Edward!

s.addColumn( Bytes.toBytes(anchor), Bytes.toBytes(anchor)  );
this looks for anchor:anchor, which I don't see

s.addColumn( Bytes.toBytes(anchor), Bytes.toBytes(anchor:Alverta Angstrom
cathodegraph)  );
this looks for anchor:anchor:Alverta Angstrom cathodegraph which I don't
see

s.addColumn( Bytes.toBytes(anchor), Bytes.toBytes(Alverta Angstrom
cathodegraph)  );
this looks for anchor:Alverta Angstrom cathodegraph, this should work, not
sure why it isn't
would recommend to check the KeyValue being returned from the family scan
and compare it
to what you are adding to your scanner, might be some weirdness with the
spaces.

Regards Erik


Re: Error running contrib tests

2009-11-24 Thread Erik Holstad
Yeah, i like that approach better.

Will do.

Erik


Re: Error running contrib tests

2009-11-23 Thread Erik Holstad
Sorry that it took so long, but I ran into trouble when trying out trunk
instead of 3.2.1.
It has nothing to do with the zookeeper code, I think, but more with my
environment.
After switching to trunk, I got back the original error and have been
struggling with it
since.
I'm running on Fedora9.
So, to make sure that it works the line /usr/local/lib/ needs to be in the
/etc/ld.so.conf.
If it is but it is still not working you have to run /sbin/ldconfig.

What I previous wrote about the different versions of Python, doesn't seem
to apply
any more, works fine for python2,5.

On my laptop where I run Ubuntu9.4 I don't have this problem.

Regards Erik


Re: Error running contrib tests

2009-11-23 Thread Erik Holstad
Have been messing around to try to get it included in the setup.py script.
But not sure
if it is good to have it there since writing to /usr/ld.so.conf and
reloading the /sbin/ldconfig
requires sudo status. But on the other hand that is needed to install from
the beginning.

I could add something like this to the end of the setup file

print Setting up dependencies
try:
  f = open(/etc/ld.so.conf, 'r+')
  content = f.read()
  path = /usr/local/lib
  if content.find(path) == -1:
f.write(path)

except Exception, e:
  pass

import subprocess
subprocess.call(['sudo', '/sbin/ldconfig'])

Erik


Re: Error running contrib tests

2009-11-20 Thread Erik Holstad
Hey!
I have been working with the zkpython module for that last couple of weeks.
After the initial problem with


ImportError: libzookeeper_mt.so.2: cannot open shared object file: No such
file or directory

I just added LD_LIBRARY_PATH=/usr/local/lib to the .bashrc file and
everything worked fine. For pydev
I had to add it as an Environment variable, to make it work.

But when moving away from raw python to to using it in a fcgi context I was
not able to get it to work under any
circumstances.

We were running this using python 2.5, which seemed to be one of the
problems, the other one that I've found
was the build script in zookeeper_home/src/contrib/
zkpython. when running sudo ant install everyhing works fine
but when trying to import it in python 2.6 I still got the same error.

Changing the setup.py file to point to the correct libraries and running
sudo python setup.py install everything works
fine and no more errors.

Just wanted to share this with the community since I've spend quite some
time trying to ge this to work.

Regards Erik


Re: Image and Video data in HBASE

2009-11-18 Thread Erik Holstad
Hi bharath!
You can have at FileOutputStream, that might be a good match for you.

Regards Erik


Re: Suggestion: Result.getTimestamp

2009-10-26 Thread Erik Holstad
Hey Doug!
Looking at the code, this doesn't seem to be very hard at all to to, since
the time stamp and the value is a map entry and for
getValue() we just return the value and not the key, ts.

Please file a Jira

Regards Erik


Re: Modified grades example not working for me in 0.20 API

2009-10-01 Thread Erik Holstad
Hey Terryg!

If you look at the createSubmittableJob code you can see that
you are setting :

job.setMapOutputKeyClass(ImmutableBytesWritable.class);

and in the map and reduce you are using Text, change them so that they match
and you should be good to go.

Regards Erik


Re: About HBase Files

2009-09-22 Thread Erik Holstad
Hey Stchu!

Not exactly sure what the messy code is except the it looks like non
printable binary data. Depending on
where you look I think it is values, offset etc.

The reason that we are keeping the family stored in the files are to leave
the door open for something called
locality groups. There was a lot of talk about keeping them or not when
doing the big 0.20 rewrite, but we decided
to leave them in. When using compression they add very little overhead and
makes it easy to send out results
without having to copy anything.

Regards Erik


Error running contrib tests

2009-09-22 Thread Erik Holstad
Hi!
I am trying out the python bindings and I followed the guide on
http://www.cloudera.com/blog/2009/05/28/building-a-distributed-concurrent-queue-with-apache-zookeeper/
Everything worked fine until the last step:

Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:56)
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type help, copyright, credits or license for more information.
 import zookeeper
Traceback (most recent call last):
  File stdin, line 1, in module
ImportError: libzookeeper_mt.so.2: cannot open shared object file: No such
file or directory

I figured that I did something wrong in my setup, so I tried to run the
contrib test and got:

python-test:
 [exec] Running src/test/clientid_test.py
 [exec] Traceback (most recent call last):
 [exec]   File src/test/clientid_test.py, line 21, in module
 [exec] import zookeeper, zktestbase
 [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
file: No such file or directory
 [exec] Running src/test/connection_test.py
 [exec] Traceback (most recent call last):
 [exec]   File src/test/connection_test.py, line 21, in module
 [exec] import zookeeper, zktestbase
 [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
file: No such file or directory
 [exec] Running src/test/create_test.py
 [exec] Traceback (most recent call last):
 [exec]   File src/test/create_test.py, line 19, in module
 [exec] import zookeeper, zktestbase, unittest, threading
 [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
file: No such file or directory
 [exec] Running src/test/delete_test.py
 [exec] Traceback (most recent call last):
 [exec]   File src/test/delete_test.py, line 19, in module
 [exec] import zookeeper, zktestbase, unittest, threading
 [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
file: No such file or directory
 [exec] Running src/test/exists_test.py
 [exec] Traceback (most recent call last):
 [exec]   File src/test/exists_test.py, line 19, in module
 [exec] import zookeeper, zktestbase, unittest, threading
 [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
file: No such file or directory
 [exec] Running src/test/get_set_test.py
 [exec] Traceback (most recent call last):
 [exec]   File src/test/get_set_test.py, line 19, in module
 [exec] import zookeeper, zktestbase, unittest, threading
 [exec] ImportError: libzookeeper_mt.so.2: cannot open shared object
file: No such file or directory

BUILD FAILED
/home/erik/src/zookeeper-3.2.1/src/contrib/build.xml:48: The following error
occurred while executing this line:
/home/erik/src/zookeeper-3.2.1/src/contrib/zkpython/build.xml:63: exec
returned: 1


I ran this test from zookeeper/src/contrib with ant test

Not sure if I'm doing something wrong or if this is a bug?

Regards Erik


Re: about java.lang.NullPointerException

2009-09-18 Thread Erik Holstad
Hi Yan!
I see that you are using  0.19 or older version of HBase. Do you have chance
to upgrade to 0.20?
HBase 0.20 is highly recommended with a lot of improvements compared to the
older versions.
If not, maybe you can supply your map code or point to the line where this
exception is thrown?

Regards Erik


Re: Hbase client program error

2009-08-31 Thread Erik Holstad
Hi Charles!
Looking at the stack trace and going to
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:89),
it seems like
the conf file you are using does not set the hbase.rootdir as described in
http://hadoop.apache.org/hbase/docs/current/api/index.html, can that be the
case?

Regards Erik


Re: 0.20 RC1

2009-08-12 Thread Erik Holstad
Hi Mike!
There is currently one bug that is being worked on that is holding up RC2,
but I think that it will be taken care
of in the next couple of days, hopefully today or tomorrow and then we are
going to put up the next RC which
hopefully will stick.

Regards Erik


Re: Map\Reduce for one row

2009-08-12 Thread Erik Holstad
Hi Fernando!
In HBase 0.20 you instantiate a MR job with a Scan object where you can set
your start and stop row.

Regards Erik


Re: Doubt regarding mem cache.

2009-08-12 Thread Erik Holstad
Hi Rakhi!

On Wed, Aug 12, 2009 at 11:49 AM, Rakhi Khatwani rkhatw...@gmail.comwrote:

 Hi,
 I am not very clear as to how does the mem cache thing works.


MemCache was a name that was used and caused some confusion of what the
purpose of it is.
It has now been renamed to MemStore and is basically a write buffer that
gets flushed to disk/HDFS
when it is too big.


 1. When you set memcache to say 1MB, does hbase write all the table
 information into some cache memory and when the size reaches IMB, it writes
 into hadoop and after that the replication takes place???

So yeah, kinda


 2. Is there any minimum limit on the mem cache property in hbase site??

Not sure if there is a minimum limit but have a look at hbase-default .xml
and
you will see a bunch of settings for memStore


 Regards,
 Raakhi


Regards Erik


Re: HBase commit autoflush

2009-08-12 Thread Erik Holstad
Hey Schubert!
The writeBuffer is sorted in processBatchOfRows just like you suggested.

Regards Erik


Re: Problem getting scheduler to work.

2009-08-04 Thread Erik Holstad
Just to let people know that encounters this in the future what to do.

After recommendations from Matei I changed the

namemapred.fairscheduler.poolnameproperty/name
from
valuemapred.job.queue.name/value
to
property
namemapred.fairscheduler.poolnameproperty/name
valuepool.name/value
/property

and set the conf accordingly and now it works fine.

Thanks Matei for the help!

Erik


Problem getting scheduler to work.

2009-08-03 Thread Erik Holstad
Hi!
I'm testing out the FairScheduler and I'm getting it to start and the Pools
that I've defined in the pools.xml file shows up and everything.
But when trying to submit a job, I don't really know where to put the name
of the pool to use for the job. All the examples that I've seen are
using JobConf and I'm currently on 0.20. I tried to put the name on the
Configuration like:

conf.set(mapred.job.queue.name, fast);
but just getting

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Queue fast
does not exist

So, how and where to I set the pool to use for the individual jobs?
Erik


Re: HBase schema design

2009-08-01 Thread Erik Holstad
Hey!
Two things, if you are only looking for data from one family, you should try
to put that into the request, might not make a difference in your case, if
you only have one family.
The other thing is how you are looking at the result. In most cases, in my
opinion, the best way to look at the result is to manually go through the
ListKeyValues in the result and not make it into a map and then ask for
the specific family+qualifier, but that is something you can test and see
which one is faster.

Erik


Re: org.apache.hadoop.hbase.client.RetriesExhaustedException

2009-07-22 Thread Erik Holstad
Hi Stchu!
To me it looks like you are overloading the system and that your server goes
down or becomes unreachable.
Is this just for testing and in that case maybe you can make the test
smaller, or even better make you cluster
bigger, if you have that option.

Regards Erik


Re: Data Processing in hbase

2009-07-22 Thread Erik Holstad
Hi Bharath!
One of the main benefits of using HBase is that it gives you random access
to your data. The main goal is not to
use it for big batch processing jobs going through all or a lot of your
data. Even though hooks into MapReduce jobs
gives you that option.

So when ever you fetch data using get and scan, that data is brought to the
client, for you to process it there. Using
HBase as the source or sink in a MR this is not the case.

What access patterns do you have to your data, are you doing a lot of random
reads or mostly batch processing of
data?

Regards Erik


Re: TableMap.initJob() function

2009-07-22 Thread Erik Holstad
Hi Bharath!
Yeah, that is what Jonathan means. If you need data from 2 tables in your
mapper, you can have one as the standard
input and in your mapper make explicit calls to HBase and request data from
it into your mapper, just like you did in
your example.

Yes, you can output as many pairs as you want from every mapper, it doesn't
have to be a one to one ratio.
Not really sure what you want to do with your data, do you want to compute
some output using data from the two tables
or simple do the same thing for both tables?

Regards Erik


Re: TableMap.initJob() function

2009-07-22 Thread Erik Holstad
Hey!
Yeah, that makes sense, just wanted to check, cause then you don't really
have to output two things from the mapper, or?
put just do something like:

Map(t1_rowkey , t1_rowresult ) {
  for (Map.Entrybyte[], Cell e: t1_rowresult.entrySet()) {
t2_rowresult = table2.get(e.getValue().getValue(),
fam:column_to_fetch);
//Process data from table1 and table2

//output to reduce
output.collect(newKey, new Data);
  }
}

Good luck!
Regards Erik


Re: Trouble in running HBase MR jobs

2009-07-22 Thread Erik Holstad
Hi Bharath!
Did you have a look at http://your_machine:50030/jobtracker.jsp
Should give you the logs, or output that you are looking for.

Regards Erik


Re: org.apache.hadoop.hbase.client.RetriesExhaustedException

2009-07-22 Thread Erik Holstad
You are welcome!
Good luck and let us know if you have some more issues.

Regards Erik


Re: Order of keys - custom comparator?

2009-07-20 Thread Erik Holstad
Hey Saptarshi!
I'm not sure what you are trying to do with the order of -5  rowKey  5,
but it is not possible to send in a
comparator when scanning, since everything has already been put into HBase.

And you are right that putting in ints as row keys doesn't keep the order
you are looking for since there is a sign bit.
But what you can do is to add an extra layer on your client that shifts all
your row keys or just use the positive ones.

Regards Erik


Re: Column names of a table in Hbase

2009-07-19 Thread Erik Holstad
Hi Bharath!
Depending if you want to get the families names only or the combination
family + qualifier there are different ways of doing this.
If you just want the family names, you can do as Tim suggested
getTableDescriptor().getFamilies().
But if you want to get all the columns(Family + qualifier) there is no other
way of doing this at the moment, but to do a full scan of that table.
We have been talking about making some extra counter calls, might be able to
work this call in there too.

Regards Erik


Re: hbase / hadoop 020 compatability error

2009-07-14 Thread Erik Holstad
Hey Yair!
Yeah, I think there has been some updates since the alpha. I'm looking at
trunk and it looks like:

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.OutputFormat;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

/**
 * Convert Map/Reduce output and write it to an HBase table. The KEY is
ignored
 * while the output value umust/u be either a {...@link Put} or a
 * {...@link Delete} instance.
 *
 * @param KEY  The type of the key. Ignored in this class.
 */
public class TableOutputFormatKEY extends OutputFormatKEY, Writable {

Regards Erik


Re: nightly builds

2009-07-14 Thread Erik Holstad
Hi Fernando!
The last thing I heard was that it was better to download the source and
build it yourself, since there were some extra
stuff included in the Hudson build.

Haven't heard any ETA for 0.20, but it should be pretty soon.

Regards Erik


Re: Question about the sequential flag on create.

2009-07-14 Thread Erik Holstad
Hey Patrik!
Thanks for the reply.
I understand all the reasons that you posted above and totally agree that
nodes should not be sorted since you then have to pay that overhead for
every node, even though you might not need or want it.
I just thought that it might be possible to create a sequential node
atomically, but I guess that is not how it works?

Regards Erik


Re: Question about the sequential flag on create.

2009-07-14 Thread Erik Holstad
Thanks Patrick!


Instantiating HashSet for DataNode?

2009-07-14 Thread Erik Holstad
I'm not sure if I've miss read the code for the DataNode, but to me it looks
like every node gets a set of children even though it might be an
ephemeral node which cannot have children, so we are wasting 240 B for every
one of those. Not sure if it makes a big difference, but just thinking
that since everything sits in memory and there is no reason to instantiate
it, maybe it would be possible just to add a check in the constructor?

Regards Erik


Re: Instantiating HashSet for DataNode?

2009-07-14 Thread Erik Holstad
Will file Jira and have a stab at it when I get a little time.

Thanks guys!


Re: Replay of hlog required, but none performed

2009-07-13 Thread Erik Holstad
Hi Joel!
Try to have a look at question number 6 on :
http://wiki.apache.org/hadoop/Hbase/FAQ

might be the cause of your problems.

Regards Erik


Re: help needed with base schema

2009-07-13 Thread Erik Holstad
Hi Piyush!
First I have to ask what version of HBase you are on?
For the new HBase 0.20 we have made some major rewrites of the internal
structure to get much greater read and
scan speeds.

Depending a little if you store your updates as versions or using the
qualifier as a timestamp we have two different ways
to query, you might know this already. you can either set the number of
versions that you which to get returned or use a filter.

Regards Erik


Question about the sequential flag on create.

2009-07-13 Thread Erik Holstad
Hey!
I have been playing around with the queue and barrier example found on the
home page and have some questions about the code.
First of all I had trouble getting the queue example to work since the code
turns the sequence number into an int and then try to get information
from it, missing out the padding, which caused some confusion at first. So I
changed it to compare the strings themselves so you didn't have to
add the padding back on.

So the fact that you have to sort the children every time you get them is a
little bit confusing to me, does anyone have simple answer to why that is?

Regards Erik


Re: Question about the sequential flag on create.

2009-07-13 Thread Erik Holstad
Hi Mahadev!
Thanks for the quick reply. Yeah, I saw that in the source, but was just
curious why that is, since it is a part of an internal
counter structure, right?

Regards Erik


Re: Unsubscribe me

2009-07-10 Thread Erik Holstad
Hey Aaron!
Did you try sending a mail to hbase-user-unsubscr...@hadoop.apache.org?

Regards Erik

On Fri, Jul 10, 2009 at 5:31 PM, Aaron Crow dirtyvagab...@yahoo.com wrote:

 Please unsubscribe me



Re: Adding a row to an hbase table

2009-07-10 Thread Erik Holstad
You have probably managed to create your rows by now, but just wanted to say
that the only thing you need to specify or create in advance,
before putting data into HBase are the families for the table. When that is
done you can just start adding data to it.

Regards Erik


Re: SVN not working

2009-07-02 Thread Erik Holstad
Hey Bharath!
Seems like that site is up but I would recommend

http://svn.apache.org/repos/asf/hadoop/hbase/trunk if you want trunk
or the latest release 19.3.

Just have to add that the upcoming 0.20 release is going to have some
serious improvements in speed
and a new API.

Regards Erik


Re: Map Reduce performance

2009-06-24 Thread Erik Holstad
Hi Ramesh!
Have to agree with Tim about the size of your cluster, I honestly a little
bit surprised that you are actually seeing
that using MR on a single node is faster, since you only get the negative
sides, setup and so on from it, but not
the good stuff.
I looked at the code and it looks good, not really doing to much in the Job,
but I doesn't look like you are doing
anything wrong. I do have some things you can think about thought when you
get a bigger cluster up and running.
1. You might want to stay away from creating Text object, we are internally
trying to move away from all usage of Text in HBase and just use
ImmutableBytesWritable or something like that.
2. Getting a HTable is expensive, so you might want to create a pool of
those connections that you can share so you don't have to get a new one for
every task, not 100% sure about the configure call, but I think it gives you
one per call, might be worth looking into.

Erik


Re: Running programs under HBase 0.20.0 alpha

2009-06-22 Thread Erik Holstad
Hi Ilpind.

The jar that you are running, does it have access to the to zoo.cfg? If not
you probably need to
add it to the jar or to your classpath.

Erik


Re: Can't list table that exists inside HBase

2009-06-17 Thread Erik Holstad
Hi Lucas!
Just a quick thought. Do you have a lot of data in your cluster or just a
few things in there?
If you don't have that much data in HBase it might not have been flushed to
disk/HDFS yet
and therefore only sits in the internal memcache in HBase, so when your
machines are turned
of, that data is lost.

Regards Erik


Re: Can't list table that exists inside HBase

2009-06-17 Thread Erik Holstad
Hi Lucas!
Yeah, have a look at HBaseAdmin and you will find flush and compact. Not
sure that compact is going to make
a big difference in your case, since you only have one flush or so per day,
but might be nice for you to run it too.
Running a compaction means that all you flushed files will be rewritten into
one single file.

Regards Erik


Re: Can't list table that exists inside HBase

2009-06-17 Thread Erik Holstad
Hi Lucas!
Not sure if you have had a look at the BigTable paper, link in the beginning
of http://hadoop.apache.org/hbase/ might clear some of the confusion.
But basically what happens is to support fast writes we only write to
memory  and periodically flush this data to disk, so while data is still in
memory
it is not persisted, needs to be written to disk/HDFS for that to be true.
We have a second mechanism for dealing with not losing data while sitting in
memory. This is called WriteAheadLog and we are still waiting for Hadoop to
support one of the features to make this happen, which hopefully will
not be too long.

Hope this helped.

Erik


Re: Row filters

2009-06-12 Thread Erik Holstad
Hi Piotr!
Yes, HBase is storing data in a sorted fashion, rows with similar row keys
will be stores close to
eachother. So if you want to only scan rows in a specific range  and use
that as an input for your
map-reduce job this is possible in two ways. You can either use a filter
that filters out the rows not
included in this range or as soon as we get 0.20 out you can specify the
start and the stop row in the
Scan object. So when you are out of that range you can safely stop scanning,
not really sure how this
was done in 0.19 and earlier.

Hope that this can help.

Regards Erik


Re: for one specific row: are the values of all columns of one family stored in one physical/grid node?

2009-06-11 Thread Erik Holstad
Hi!
Just to be clear what is being said here is that every region contains a set
of stores which holds
one family each, for that specific row range. And one store can hold many
files with data for that
store, which in the case of a major compaction turns into one single file.

Erik


Re: Regarding HBase

2009-06-11 Thread Erik Holstad
Hi!
Not really sure what kind of problems you have encountered using
Hadoop/Hdfs/Map-Reduce and what you are trying
to solve by using HBase.

The thing that HBase offers you is, see first two rows of text in
http://hadoop.apache.org/hbase/

If you could provide us with a little bit more information about the
problems that you have, it might be easier for us to help.

Regards Erik


Re: Frequent changing rowkey - HBase insert

2009-06-08 Thread Erik Holstad
Hi Ilpind!

On Mon, Jun 8, 2009 at 8:45 AM, llpind sonny_h...@hotmail.com wrote:


 The insert works well for when I have a row key which is constant for a
 long
 period of time, and I can split it up into blocks.  But when the row key
 changes often, then insert performance over time starts to suffer.  The
 suggestion made by Ryan does help, and I was eventually able to get the
 entire data set into HBase. ( ~120 Million records)

 Currently working on some analysis, and had a question about the java api.
 Is there a way to get record count given a row key?  something like: long
 getColumnCount (rowkey).  So it doesn't bring down any data to client, but
 simply returns the size..?


We have been talking about something similar to this for scanners. A call
that
just counts the number of rows between a start and a stop row and doesn't
return
any data.
So that would make it 4 calls if I'm not mistaken :
countRows(Scan scan)
countFamilies(Listbyte[] families)
countQualifiers(byte [] family)
countVersions(byte[] family, byte[] qualifier, long minTime, long maxTime)

or maybe just keep it simple and use.:
countRows(Scan scan)
countFamilies(Get get)
countQualifiers(Get get)
countVersions(Get get)

We talked about just having a special serializer that doesn't return any
data just the count.

How does that sound to you?

Erik


Re: Question regarding MR for Hbase

2009-06-04 Thread Erik Holstad
Hey Vijay!
You can have a look at
http://wiki.apache.org/hadoop/Hbase/MapReduce
That might make things easier to understand, just remember that the new API
for 0.20 will look different,
but the concept will be the same

Erik


Re: Uses cases for checkAndSave?

2009-06-03 Thread Erik Holstad
@Ryan
I kind of agree with you in general about the interface put I think 3 input
classes where 2 has to match is not going to be very easy to deal with.
Maybe we should have checkAndPut(ListKeyValues kvs, Put put) where the
list of KeyValues are the expected keys+values.

@Guilherme
I pushed a version of checkAndPut() to HBASE-1304 so you are more than
welcome to have a look at it, test it and come with feedback when you have
time.

Erik


Re: Clarifying the role of HBase Versions

2009-06-02 Thread Erik Holstad
Hi Ryan!

I previous versions of HBase when dealing with querying versions they were
stored in a TreeMap which added complexity and made the query somewhat
slower. With 0.20 data returned to the client is just an array of KeyValues
which is the new storage format. When it comes down to splits, regions are
split by rows, so it doesn't realy matter if you have many qualifiers or
versions in that case. When it comes to compactions there should be no
difference either compared to qualifiers.

If we wouldn't use timestamp as the major key for versions what do you have
in mind? You can set your own timestamp clientside if you which, but I must
warn you that this might give you unexpected results if you don't fully
understand how to use this feature.

Regards Erik


Uses cases for checkAndSave?

2009-06-02 Thread Erik Holstad
Hi!
I'm working on putting checkAndSave back into 0.20 and just want to check
with the people that are using it how they are using it
so that I can make it as good as possible for these users.

Since the API has changed from earlier versions there are some things that
one need to think about.
For now in the new API there are now Updates, just Put and Delete, so for
now I need to know if users used to delete in the old batchUpdate
or just put?

The new return format Result might seem like a good way to send in the data
to be used as actual, but there is no super easy way to build that
on the client side for now, so would be good to know how you are doing this.
If you do a get, save the result and then use it for the check or if you
just create new structures on the client?

Regards Erik


Re: Uses cases for checkAndSave?

2009-06-02 Thread Erik Holstad
Hi!

On Tue, Jun 2, 2009 at 11:17 AM, Guilherme Germoglio germog...@gmail.comwrote:

 Hi Erik,

 For now, I'm using checkAndSave in order to make sure that a row is only
 created but not overwritten by multiple threads. So, checkAndSave is mostly
 invoked with a new structure created on the client. Actually, I'm checking
 if a specific deleted column in empty. If the deleted column is not
 empty, then the row creation cannot be performed. There are another few
 tricky cases I'm using it, but I'm sure that making that Result object more
 difficult to create than putting values on a map would be bad for me. :-)

So you have a row with family and qualifier that you check to see if it is
empty
and if it is you insert a new row? So basically you use it as an atomic
rowExist
checker or? Are you usually batching this checks or would it be ok with
something like:

public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier,
byte[] value, Put put){}
or
public boolean checkAndPut(KeyValue checkKv, Put put){}
for now?



 However, here's an idea. What if Put and Delete objects have a field
 condition (maybe, onlyIf would be a better name) which is exactly the
 map with columns and expected values. So, a given Put or Delete of an
 updates list will only happen if those expected values match.


Puts and deletes are pretty much just ListKeyValue which is basically a
Listbyte[].
I don't think that we want to add complexity for puts and deletes now that
we have worked
so hard to make it faster and more bare bone.


 Also, maybe it should be possible to indicate common expected values for
 all
 updates of a list too, so a client won't have to put in all updates the
 same
 values if needed. But we must remember to solve the conflicts of expected
 values.

Not really sure if you mean that we would check the value of a key before
inserting the new
value? That would mean that you would have to do a get for every put/delete
which is not
something we want in the general case.



 (By the way, I haven't seen the guts of new Puts and Deletes, so I don't
 know how difficult would it be to implement it -- but I can help, if
 necessary)

 Thanks,

 On Tue, Jun 2, 2009 at 2:34 PM, Erik Holstad erikhols...@gmail.com
 wrote:

  Hi!
  I'm working on putting checkAndSave back into 0.20 and just want to check
  with the people that are using it how they are using it
  so that I can make it as good as possible for these users.
 
  Since the API has changed from earlier versions there are some things
 that
  one need to think about.
  For now in the new API there are now Updates, just Put and Delete, so for
  now I need to know if users used to delete in the old batchUpdate
  or just put?
 
  The new return format Result might seem like a good way to send in the
 data
  to be used as actual, but there is no super easy way to build that
  on the client side for now, so would be good to know how you are doing
  this.
  If you do a get, save the result and then use it for the check or if you
  just create new structures on the client?
 
  Regards Erik
 



 --
 Guilherme

 msn: guigermog...@hotmail.com
 homepage: http://germoglio.googlepages.com


Regards Erik


Re: KeyValue and BatchOperation

2009-05-28 Thread Erik Holstad
Hey Studipto!


On Thu, May 28, 2009 at 7:01 PM, Sudipto Das sudi...@cs.ucsb.edu wrote:

 Hi,

 I am using HBase for development of a system for my research project. I
 have
 a few questions regarding some recent API and Class changes in HBase which
 I
 suppose are to be released in HBase 0.20.

 * I saw that for internal operations of HBase a new class
 org.apache.hadoop.hbase.KeyValue has replaced the usage of a lot of other
 classes (for example HLogEdit and HStoreKey), but till now, there is no
 change in the client API which still sends updates via BatchOperation and
 BatchUpdate which is then converted to KeyValue. So I am a bit confused
 whether to use KeyValue or the present BatchOperation class for
 communicating with the RegionServers. I am not using the HBase Client API,
 so I am not limited by that, so I was wondering which class would be a
 better choice to be compatible with 0.20.


What you are seeing in trunk at the moment is an intermediate step towards
the new API, if you want to look at the most recent code for the big patch
HBASE-1304 which includes the new API + most of the new server
implementation code you can get that at
http://github.com/ryanobjc/hbase/tree/hbase-1304




 * What is a good way of determining the HeapSize (in order to implement the
 hbase.io.HeapSize interface) for newly added classes? I saw that
 hbase.io.HeapSize has a few new constants provide sizes of some of the
 common types, but most HBase internal classes use an assigned value of
 HEAP_TAX, and I could not figure out how the value was obtained.


HeapSize is determined  by looking at the actual memory usage of the class
that it implements, in an attempt to count every
byte used for it. There are a couple of methods being used to figure out the
size, but we haven't set any standard way yet, but
will probably be done pretty soon.


 * For IPC, I could not pass a List of Writables since a List is not
 Writable. Is there any plan for adding a utility class (or is there already
 any such class available?) that can act as a List of Writables type and can
 be shipped across the network using IPC.


If I'm not mistaken, you can write an array of Writables.


 * The HLog right now is very rigid in terms of what it accepts as Log
 Entries. Is there something inherent in Hadoop IO that prevents the Logger
 from accepting any Writable as log edit, rather than the mandatory KeyValue
 (or HLogEdit in 0.19.x). This will make the logger flexible and reusable
 for
 other uses. Apparently, Hadoop IO just needs writables. Is there any catch
 in a generic type for the Log Key and Log Edit?

 * As noted in the Bigtable paper, a single logger thread can become a
 bottleneck for an update intensive workload with Write Ahead Logging. I was
 wondering if an advanced logger will be available in some newer version of
 HBase? Advanced as in multi-threaded logger supporting asynchronous appends
 serialized using a common log sequence number.

 Any comments and suggestions would be helpful.

 Thanks in advance.

 Regards
 Sudipto

 PhD Candidate
 CS @ UCSB
 Santa Barbara, CA 93106, USA
 http://www.cs.ucsb.edu/~sudipto http://www.cs.ucsb.edu/%7Esudipto 
 http://www.cs.ucsb.edu/%7Esudipto



Regards Erik


Re: Urgent: HBase Data Lost

2009-05-25 Thread Erik Holstad
Hey Arber!
What it sounds like to me is that the table Meta hadn't been flushed to disk
and was inly sitting on memory, so
when the machine went down that data got lost.

Regards Erik


Re: Obtaining the Timestamp of my last insert/update

2009-05-25 Thread Erik Holstad
Hi Bob!
What would be the use case where you could use an explicit timestamp instead
of using the rowlock?
To me they are used for different things, so what are you planning to to
with them?

Regards Erik


Re: Exception in rowcount program

2009-04-24 Thread Erik Holstad
Hi Aseem!
Just out of curiosity, you followed the instructions on
http://wiki.apache.org/hadoop/Hbase/MapReduce
and it didn't work for you?
The reason that I'm asking case maybe we need to change something or be
clearer?
Regards Erik


Re: Migration

2009-04-16 Thread Erik Holstad
Hi Rakhi!
Not exactly sure how the migration tool is going to look for 0.20 but the
whole disk storage format
is going to change so I don't think that you will be able to just use
distcp.

Regards Erik


Re: Some HBase FAQ

2009-04-13 Thread Erik Holstad
On Mon, Apr 13, 2009 at 7:12 AM, Puri, Aseem aseem.p...@honeywell.comwrote:

 Hi

I am new HBase user. I have some doubts regards
 functionality of HBase. I am working on HBase, things are going fine but
 I am not clear how are things happening. Please help me by answering
 these questions.



 1.  I am inserting data in HBase table and all regions get balanced
 across various Regionservers. But what will happens when data increases
 and there is not enough space in Regionservers to accommodate all
 regions. So I will like this that some regions in Regionserver and some
 are at HDFS but not on Regionserver or HBase Regioservers stop taking
 new data?

Not really sure what you mean here, but if you are asking what to do when
you are
running out of disk space on the regionservers, the answer is add another
machine
or two.





 2.  When I insert data in HBase table, 3 to 4 mapfiles are generated
 for one category, but after some time all mapfiles combines as one file.
 Is this we call minor compaction actually?

When all current mapfiles and memcache are combined into one files, this
is called major compaction, see BigTable paper for more details.





 3.  For my application where I will use HBase will have updates in a
 table frequently. Should is use some other database as a intermediate to
 store data temporarily like MySQL and then do bulk update on HBase or
 should I directly do updates on HBase. Please tell which technique will
 be more optimized in HBase?

HBase is fast for reads which has so far been the main focus of the
development, with
0.20 we can hopefully add even fast random reading to it to make it a more
well rounded
system. Is HBase too slow for you today when writing to it and what are your
requirements?

Regards Erik


Re: How to get all versions of the data in HBase

2009-04-11 Thread Erik Holstad
Hi Ideal!
It looks like the get call is the right call to make, not really sure why
you are not getting more than 1 in return, you
should at least get 3 back since that it the default setting of versions to
keep. Not sure if you changed this setting
but when you create your HColumnDescriptor you set the number of versions to
keep.

Regards Erik


Re: using cascading fro map-reduce

2009-04-08 Thread Erik Holstad
Hi!
If you are interested in Cascading I recommend you to ask on the Cascading
mailing list or come ask in the irc channel.
The mailing list can be found at the bottom left corner of www.cascading.org
.

Regards Erik


Re: try to run PerformanceEvaluation and encounter RetriesExhaustedException

2009-04-06 Thread Erik Holstad
Hi Stuart!
We are still waiting for Hadoop 0.20 to be released and after we have that
maybe 2 weeks or so
to finalize our release and get most of the bugs fixed, not really sure how
they are doing over at
Hadoop, but I would guess 4-6 weeks or so. Hopefully we will have a
functioning trunk in not to long
with most of the 0.20 features functioning, but there are still some open
issues for 0.20, have a look at
https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hiderequestId=12313132

Regards Erik


Re: timestamp uses

2009-04-03 Thread Erik Holstad
Hi Genady!
If everything goes as planned there will be a possibility to input a
TimeRange into every get query in 0.20, so that you will
be able to do the call, give me all data from row r, family f and column c
in the timerange t2 to t1. The nice thing about the
new implementation is also that you will not have to go through all the
storefiles when you get to the storefiles with older
data than t1 etc, so the query is also going to be faster than before.

Regards Erik


Re: How is Hbase column oriented ?

2009-04-02 Thread Erik Holstad
Hi Hiro!

I attached a version of the current system layout, things will change a the
lowest level for 0.20
but the other things should stay the same.

At the very bottom we are going to have something called a Hfile which is
basically a list of
something called KeyValues. The KeyValues are a byte [] including the key,
row/family/column/timetsamp/type, and the value. What makes it column
oriented at this level
is that families are physically stored together. Depending on how big a row
is you can different
number of rows covered in a storefile.

Hope this helps.

Regards Erik


Re: Novice Hbase user needs help with data upload - gets a RetriesExhaustedException, followed by NoServerForRegionException

2009-04-01 Thread Erik Holstad
Hi Ron!
you can try to look at:

http://wiki.apache.org/hadoop/Hbase/Troubleshooting#5 and 6

http://hadoop.apache.org/hbase/docs/r0.19.0/api/overview-summary.html#overview_description

Some similar problems can be found in:

http://www.nabble.com/RetriesExhaustedException--for-TableReduce-td22569113.html
http://www.nabble.com/RetriesExhaustedException!-td22408156.html

Hope that it can be of help
Regards Erik


Re: ANN: hbase-0.19.1 release candidate 1

2009-03-12 Thread Erik Holstad
Hey Stack!
Downloaded and tried to use it with Cascading, but it seems like HBASE-1240
and HBASE-1221 didn't
quite make it, or is it just me?

Regards Erik




Re: MapReduce job to update HBase table in-place

2009-02-25 Thread Erik Holstad
HI Stuart!
According to our test it has shown that inputting your data using a
BatchUpdate
is faster than using the collector, but these were do a while ago, so if you
find
something else please let us know.

Erik


Re: Probelms getting Eclipse Hadoop plugin to work.

2009-02-20 Thread Erik Holstad
Hi guys!
Thanks for your help, but still no luck, I did try to set it up on a
different machine with Eclipse 3.2.2 and the
IBM plugin instead of the Hadoop one, in that one I only needed to fill out
the install directory and the host
and that worked just fine.
I have filled out the ports correctly and the cluster is up and running and
works just fine.

Regards Erik


Re: Probelms getting Eclipse Hadoop plugin to work.

2009-02-19 Thread Erik Holstad
Thanks guys!
Running Linux and the remote cluster is also Linux.
I have the properties set up like that already on my remote cluster, but
not sure where to input this info into Eclipse.
And when changing the ports to 9000 and 9001 I get:

Error: java.io.IOException: Unknown protocol to job tracker:
org.apache.hadoop.dfs.ClientProtocol

Regards Erik


Re: Map/Recuce Job done locally?

2009-02-19 Thread Erik Holstad
Hey Philipp!
MR jobs are run locally if you just run the java file, to get it running in
distributed mode
you need to create a job jar and run that like ./bin/hadoop jar ...

Regards Erik


Re: Map/Recuce Job done locally?

2009-02-19 Thread Erik Holstad
Hey Philipp!
Not sure about your time tracking thing, probably works, I've just used a
bash script
to start the jar and then you can do the timing in the script.
About how to compile the jars, you need to include the dependencies too, but
you will see what you are missing when you run the job.

Regards Erik


Probelms getting Eclipse Hadoop plugin to work.

2009-02-18 Thread Erik Holstad
I'm using Eclipse 3.3.2 and want to view my remote cluster using the Hadoop
plugin.
Everything shows up and I can see the map/reduce perspective but when trying
to
connect to a location I get:
Error: Call failed on local exception

I've set the host to for example xx0, where xx0 is a remote machine
accessible from
the terminal, and the ports to 50020/50040 for M/R master and
DFS master respectively. Is there anything I'm missing to set for remote
access to the
Hadoop cluster?

Regards Erik


Re: backup tables using ImportMR / ExportMR ( HBASE-974 )

2009-02-11 Thread Erik Holstad
Hey Yair!
Ran a test Export and a test Import this morning and except from the fact
that they were not the
fastest on the planet :) they worked just fine. The only thing that I needed
to change was to remove
the dependency of the HBaseRef from the makeImportJar.sh.

Not really sure what you mean by the reducer class not being set in the job
conf? Calling the method
TableMapReduceUtil.initTableReduceJob(outputTable, MyReducer.class, c) does
just that, no?
Or do you not want to use that method and just setting it yourself?
Or are you talking about the difference, in the Importer now, between
setting up the input and the output?

Regards Erik


Re: backup tables using ImportMR / ExportMR ( HBASE-974 )

2009-02-10 Thread Erik Holstad
Hey Yair!

So you are saying that you don't think that the problem is in the importer
but in your cluster
setup?

Thanks for finding all these small things, like with the @Override for
example.
I haven't used the code in a little while, but will get my hands dirty
tomorrow morning, so we
can figure this out and get it working for you, my test cluster is down at
the moment but will
hopefully be up tomorrow :)

So, maybe you can some hang out on the IRC and we will try to get this going
tomorrow?

Regards Erik


Re: backup tables using ImportMR / ExportMR ( HBASE-974 )

2009-02-06 Thread Erik Holstad
Hey Yair!
Sorry about that, HBaseRef is not needed for the import. I Deleted the
makeJar file, removed the code
and uploaded a new version. SO you can just remove it in your code or
download the new version.

If you have any more questions, please let me know.

Regards Erik


Re: backup tables using ImportMR / ExportMR ( HBASE-974 )

2009-02-06 Thread Erik Holstad
Hey Yair!
Answers in line.


 1) I had to replace the new Configuration() to new
 HBaseConfiguration() in the java source or the Export didn't work
 properly.

This is probably due to the fact that the api was changed after I used it
last.




 2) I had to add hadoop jar and hbase jar to the classpath in the
 make.jar.sh or they wouldn't compile

We have these setting globally so I didn't think about it

If you can post your updates on these files that would be great, or if you
send them
to me, I will put them up.




 3) When running the ImportMR.sh, I always get the following error after
 100% map and 40% or 66% reduce. Please let me know if you are familiar
 with the problem
 Thanks
 -Yair

 09/02/06 15:57:52 INFO mapred.JobClient:  map 100% reduce 66%
 09/02/06 16:00:47 INFO mapred.JobClient:  map 100% reduce 53%
 09/02/06 16:00:47 INFO mapred.JobClient: Task Id :
 attempt_200902061529_0007_r_00_0, Status : FAILED
 org.apache.hadoop.hbase.MasterNotRunningException: localhost:6
at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getMaster
 (HConnectionManager.java:236)
at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateReg
 ion(HConnectionManager.java:422)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:114)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:74)
at ImportMR$MyReducer.reduce(ImportMR.java:138)
at ImportMR$MyReducer.reduce(ImportMR.java:128)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

 attempt_200902061529_0007_r_00_0: Exception in thread Timer thread
 for monitoring mapred java.lang.NullPointerException
 attempt_200902061529_0007_r_00_0:   at
 org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaConte
 xt.java:195)
 attempt_200902061529_0007_r_00_0:   at
 org.apache.hadoop.metrics.ganglia.GangliaContext.emitMetric(GangliaConte
 xt.java:138)
 attempt_200902061529_0007_r_00_0:   at
 org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaConte
 xt.java:123)
 attempt_200902061529_0007_r_00_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(Abstrac
 tMetricsContext.java:304)
 attempt_200902061529_0007_r_00_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(Abstract
 MetricsContext.java:290)
 attempt_200902061529_0007_r_00_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(Abstract
 MetricsContext.java:50)
 attempt_200902061529_0007_r_00_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetri
 csContext.java:249)
 attempt_200902061529_0007_r_00_0:   at
 java.util.TimerThread.mainLoop(Timer.java:512)
 attempt_200902061529_0007_r_00_0:   at
 java.util.TimerThread.run(Timer.java:462)
 09/02/06 16:00:48 INFO mapred.JobClient:  map 100% reduce 13%
 09/02/06 16:00:48 INFO mapred.JobClient: Task Id :
 attempt_200902061529_0007_r_02_0, Status : FAILED
 org.apache.hadoop.hbase.MasterNotRunningException: localhost:6
at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getMaster
 (HConnectionManager.java:236)
at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateReg
 ion(HConnectionManager.java:422)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:114)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:74)
at ImportMR$MyReducer.reduce(ImportMR.java:138)
at ImportMR$MyReducer.reduce(ImportMR.java:128)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

 attempt_200902061529_0007_r_02_0: Exception in thread Timer thread
 for monitoring mapred java.lang.NullPointerException
 attempt_200902061529_0007_r_02_0:   at
 org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaConte
 xt.java:195)
 attempt_200902061529_0007_r_02_0:   at
 org.apache.hadoop.metrics.ganglia.GangliaContext.emitMetric(GangliaConte
 xt.java:138)
 attempt_200902061529_0007_r_02_0:   at
 org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaConte
 xt.java:123)
 attempt_200902061529_0007_r_02_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(Abstrac
 tMetricsContext.java:304)
 attempt_200902061529_0007_r_02_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(Abstract
 MetricsContext.java:290)
 attempt_200902061529_0007_r_02_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(Abstract
 MetricsContext.java:50)
 attempt_200902061529_0007_r_02_0:   at
 org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetri
 csContext.java:249)
 attempt_200902061529_0007_r_02_0:   at
 java.util.TimerThread.mainLoop(Timer.java:512)
 attempt_200902061529_0007_r_02_0:   at
 

Re: Another Question on Backup Tables

2009-01-23 Thread Erik Holstad
Michael!
The code for 974 should be up now and ready for you to test.
Please let us know if you have any problems.

Erik

On Fri, Jan 23, 2009 at 10:51 AM, stack st...@duboce.net wrote:

 You'd also need to copy out the tables' entries in .META.

 There are some scripts in HBASE-643 for copying and renaming tables that
 might serve as starting point copying a table.

 St.Ack


 Ryan Rawson wrote:

 You'd have to get at least:

 /hbase/$TABLE_NAME
 and
 /hbase/log*

 Not too sure how hbase would handle that though...  The logs contain
 roll-forward info for every table.  So you'd have entries for tables that
 dont exist.

 -ryan

 On Fri, Jan 23, 2009 at 2:05 AM, michael.dag...@gmail.com wrote:



 Hi, all

 As I understand from the answers of Jean-Daniel and Stack
 we can backup the database just by copying the HDFS folder
 but what if I want to backup only a few tables ?

 I guess I can scan the tables and copy the scanned data
 to somewhere on the backup storage. Are there other solutions ?

 Thank you for your cooperation,
 M.










Re: How to read a subset of records based on a column value in a M/R job?

2008-12-18 Thread Erik Holstad
Hi Tigertail!
Not sure if I understand you original problem correct, but it seemed to me
that you wanted to just get
the rows with the value 1 in a column, right?

Did you try to only put that column in there for the rows that you want to
get and use that as an input
to the MR?

I haven't timed my MR jobs with this approach so I'm not sure how it is
handled internally, but maybe
it is worth giving it a try.

Regards Erik

On Wed, Dec 17, 2008 at 8:37 PM, tigertail tyc...@yahoo.com wrote:


 Hi St. Ack,

 Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4
 CPUs).
 Suppose the 1M row keys are known beforehand and saved in an file, I just
 read each key into a mapper and use table.getRow(key) to get the record. I
 also tried to increase the # of map tasks, but it did not improve the
 performance. Actually, even worse. Many tasks are failed/killed with sth
 like no response in 600 seconds.


 stack-3 wrote:
 
  For A2. below, how many map tasks?  How did you split the 1M you wanted
  to fetch? How many of them ran concurrently?
  St.Ack
 
 
  tigertail wrote:
  Hi, can anybody help? Hopefully the following can be helpful to make my
  question clear if it was not in my last post.
 
  A1. I created a table in HBase and then I inserted 10 million records
  into
  the table.
  A2. I ran a M/R program with totally 10 million get by rowkey
 operation
  to
  read the 10M records out and it took about 3 hours to finish.
  A3. I also ran a M/R program which used TableMap to read the 10M records
  out
  and it just took 12 minutes.
 
  Now suppose I only need to read 1 million records whose row keys are
  known
  beforehand (and let's suppose the worst case the 1M records are evenly
  distributed in the 10M records).
 
  S1. I can use 1M get by rowkey. But it is slow.
  S2. I can also simply use TableMap and only output the 10M records in
 the
  map function but it actually read the whole table.
 
  Q1. Is there some more efficient way to read the 1M records, WITHOUT
  PASSING
  THOUGH THE WHOLE TABLE?
 
  How about if I have 1 billion records in an HBase table and I only need
  to
  read 1 million records in the following two scenarios.
 
  Q2. suppose their row keys are known beforehand
  Q3. or suppose these 1 million records have the same value on a column
 
  Any input would be greatly appreciated. Thank you so much!
 
 
  tigertail wrote:
 
  For example, I have a HBase table with 1 billion records. Each record
  has
  a column named 'f1:testcol'. And I want to only get the records with
  'f1:testcol'=0 as the input to my map function. Suppose there are 1
  million such records, I would expect this would be must faster than I
  get
  all 1 billion records into my map function and then do condition check.
 
  By searching on this board and HBase documents, I tried to implement my
  own subclass of TableInputFormat and set a ColumnValueFilter in
  configure
  method.
 
  public class TableInputFilterFormat extends TableInputFormat implements
  JobConfigurable {
private final Log LOG =
  LogFactory.getLog(TableInputFilterFormat.class);
 
public static final String FILTER_LIST = hbase.mapred.tablefilters;
 
public void configure(JobConf job) {
  Path[] tableNames = FileInputFormat.getInputPaths(job);
 
  String colArg = job.get(COLUMN_LIST);
  String[] colNames = colArg.split( );
  byte [][] m_cols = new byte[colNames.length][];
  for (int i = 0; i  m_cols.length; i++) {
m_cols[i] = Bytes.toBytes(colNames[i]);
  }
  setInputColums(m_cols);
 
  ColumnValueFilter filter = new
 
 ColumnValueFilter(Bytes.toBytes(f1:testcol),ColumnValueFilter.CompareOp.EQUAL,
  Bytes.toBytes(0));
  setRowFilter(filter);
 
  try {
setHTable(new HTable(new HBaseConfiguration(job),
  tableNames[0].getName()));
  } catch (Exception e) {
LOG.error(e);
  }
}
  }
 
  However, The M/R job with RowFilter is much slower than the M/R job w/o
  RowFilter. During the process many tasked are failed with sth like
 Task
  attempt_200812091733_0063_m_19_1 failed to report status for 604
  seconds. Killing!. I am wondering if RowFilter can really decrease the
  record feeding from 1 billion to 1 million? If it cannot, is there any
  other method to address this issue?
 
  I am using Hadoop 0.18.2 and HBase 0.18.1.
 
  Thank you so much in advance!
 
 
 
 
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
 Sent from the HBase User mailing list archive at Nabble.com.




Re: Row with many-many columns

2008-12-18 Thread Erik Holstad
Hi!
I'm not totally sure about this, but I think that 1 family is stored in 1
HStore which
consists of multiple HStoreFiles which in their turn consists of mapfiles
and an index file.

Regards Erik

On Wed, Dec 17, 2008 at 8:52 AM, Slava Gorelik slava.gore...@gmail.comwrote:

 Hi.

 I think it should be faster, even for the reason that each column family is
 separate map file (correct me if i wrong).It means when you will ask for
 specific column family, HBase will not open other map files.
 Btw, the functionality to get particular column family will be released in
 0.19 : https://issues.apache.org/jira/browse/HBASE-857


 Best Regards.


 On Wed, Dec 17, 2008 at 5:59 PM, Michael Dagaev michael.dag...@gmail.com
 wrote:

  Hi, all
 
 Let there is a row with A, B, and C column families. Let C column
  family many-many columns (qualifiers). As I understand, retrieve of
  such a row is slow. What if  I retrieve only A and B columns but not C
  ? I guess it will be much faster. Is it correct?
 
  Thank you for your cooperation,
  M.
 



Re: How to read a subset of records based on a column value in a M/R job?

2008-12-18 Thread Erik Holstad
Hi Tigertail!
I have written some MR jobs earlier but nothing fancy like implementing your
own filter like
you, but what I do know it that you can specify the columns that you want to
read as the
input to the maptask. But since I'm not sure how that filter process is
handled internally I
can say if it reads in all the columns and than filter them out or how it
actually does it, please
let me know how it works, you people out there that have this knowledge :).

But you could try to have a column family age: and than have 1 column for
every age that you
want to be able to specify, for example age:30 or something, so you don't
have to look at the
value of the column, but rather using the column itself as the key.

Hope that it helped you a little but, and please let me know what kind of
results that you come up with.

Regards Erik

On Thu, Dec 18, 2008 at 9:26 AM, tigertail tyc...@yahoo.com wrote:


 Thanks Erik,

 What I want is either by row key values, or by a specific value in a
 column,
 to quickly return a small subset without reading all records into Mapper.
 So
 I actually have two questions :)

 For the column-based search, for example, I have 1 billion people records
 in
 the table, the row key is the name, and there is an age column. Now I
 want to find the records with age=30. How can I avoid to read every record
 into mapper and then filter the output?

 For searching by row key values, let's suppose I have 1 million people's
 names. Is there a more efficient way than running 1 million times
 table.getRow(name), in case the name strings are randomly distributed
 (and
 hence it is useless to write a new getSplits)?

  Did you try to only put that column in there for the rows that you want
  to
  get and use that as an input
  to the MR?

 I am not sure I get you there. I can use
 TableInputFormatBase.setInputColums
 in my program to only return the age' column, but still, I need to read
 every row from the table into mapper. Or my understanding is wrong, can you
 give more details on your thought?

 Thanks again.



 Erik Holstad wrote:
 
  Hi Tigertail!
  Not sure if I understand you original problem correct, but it seemed to
 me
  that you wanted to just get
  the rows with the value 1 in a column, right?
 
  Did you try to only put that column in there for the rows that you want
 to
  get and use that as an input
  to the MR?
 
  I haven't timed my MR jobs with this approach so I'm not sure how it is
  handled internally, but maybe
  it is worth giving it a try.
 
  Regards Erik
 
  On Wed, Dec 17, 2008 at 8:37 PM, tigertail tyc...@yahoo.com wrote:
 
 
  Hi St. Ack,
 
  Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4
  CPUs).
  Suppose the 1M row keys are known beforehand and saved in an file, I
 just
  read each key into a mapper and use table.getRow(key) to get the record.
  I
  also tried to increase the # of map tasks, but it did not improve the
  performance. Actually, even worse. Many tasks are failed/killed with sth
  like no response in 600 seconds.
 
 
  stack-3 wrote:
  
   For A2. below, how many map tasks?  How did you split the 1M you
 wanted
   to fetch? How many of them ran concurrently?
   St.Ack
  
  
   tigertail wrote:
   Hi, can anybody help? Hopefully the following can be helpful to make
  my
   question clear if it was not in my last post.
  
   A1. I created a table in HBase and then I inserted 10 million records
   into
   the table.
   A2. I ran a M/R program with totally 10 million get by rowkey
  operation
   to
   read the 10M records out and it took about 3 hours to finish.
   A3. I also ran a M/R program which used TableMap to read the 10M
  records
   out
   and it just took 12 minutes.
  
   Now suppose I only need to read 1 million records whose row keys are
   known
   beforehand (and let's suppose the worst case the 1M records are
 evenly
   distributed in the 10M records).
  
   S1. I can use 1M get by rowkey. But it is slow.
   S2. I can also simply use TableMap and only output the 10M records in
  the
   map function but it actually read the whole table.
  
   Q1. Is there some more efficient way to read the 1M records, WITHOUT
   PASSING
   THOUGH THE WHOLE TABLE?
  
   How about if I have 1 billion records in an HBase table and I only
  need
   to
   read 1 million records in the following two scenarios.
  
   Q2. suppose their row keys are known beforehand
   Q3. or suppose these 1 million records have the same value on a
 column
  
   Any input would be greatly appreciated. Thank you so much!
  
  
   tigertail wrote:
  
   For example, I have a HBase table with 1 billion records. Each
 record
   has
   a column named 'f1:testcol'. And I want to only get the records with
   'f1:testcol'=0 as the input to my map function. Suppose there are 1
   million such records, I would expect this would be must faster than
 I
   get
   all 1 billion records into my map function and then do condition
  check.
  
   By searching on this board

Redirecting the logs to remote log server?

2008-11-21 Thread Erik Holstad
Hi!
I have been trying to get the logs from Hadoop to redirect to a remote log
server.
Tried to add the socket appender in the log4j.properties file in the conf
directory
and also to add commons.logging + log4j jars + the same log4j.properties
file
into the WEB-INF of the master but I still get nothing in the logs on the
log server,
what is it that i'm missing here?

Regards Erik


Re: Cleaning up files in HDFS?

2008-11-17 Thread Erik Holstad
Hi!
I thought that the trash function was only working for files that were
already
deleted and not for files that are to be deleted, but it would be nice if
you
could set it up to work on a specific directory.

Erik

On Fri, Nov 14, 2008 at 6:07 PM, lohit [EMAIL PROTECTED] wrote:

 Have you tried fs.trash.interval

 property
  namefs.trash.interval/name
  value0/value
  descriptionNumber of minutes between trash checkpoints.
  If zero, the trash feature is disabled.
  /description
 /property

 more info about trash feature here.
 http://hadoop.apache.org/core/docs/current/hdfs_design.html


 Thanks,
 Lohit

 - Original Message 
 From: Erik Holstad [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, November 14, 2008 5:08:03 PM
 Subject: Cleaning up files in HDFS?

 Hi!
 We would like to run a delete script that deletes all files older than
 x days that are stored in lib l in hdfs, what is the best way of doing
 that?

 Regards Erik




Cleaning up files in HDFS?

2008-11-14 Thread Erik Holstad
Hi!
We would like to run a delete script that deletes all files older than
x days that are stored in lib l in hdfs, what is the best way of doing that?

Regards Erik


Re: Passing Constants from One Job to the Next

2008-10-30 Thread Erik Holstad
Hi!
Is there a way of using the value read in the configure() in the Map or
Reduce phase?

Erik

On Thu, Oct 23, 2008 at 2:40 AM, Aaron Kimball [EMAIL PROTECTED] wrote:

 See Configuration.setInt() in the API. (JobConf inherits from
 Configuration). You can read it back in the configure() method of your
 mappers/reducers
 - Aaron

 On Wed, Oct 22, 2008 at 3:03 PM, Yih Sun Khoo [EMAIL PROTECTED] wrote:

  Are you saying that I can pass, say, a single integer constant with
 either
  of these three: JobConf? A HDFS file? DistributedCache?
  Or are you asking if I can pass given the context of: JobConf? A HDFS
 file?
  DistributedCache?
  I'm thinking of how to pass a single int so from one Jobconf to the next
 
  On Wed, Oct 22, 2008 at 2:57 PM, Arun C Murthy [EMAIL PROTECTED]
 wrote:
 
  
   On Oct 22, 2008, at 2:52 PM, Yih Sun Khoo wrote:
  
I like to hear some good ways of passing constants from one job to the
   next.
  
  
   Unless I'm missing something: JobConf? A HDFS file? DistributedCache?
  
   Arun
  
  
  
   These are some ways that I can think of:
   1)  The obvious solution is to carry the constant as part of your
 value
   from
   one job to the next, but that would mean every value would hold that
   constant
   2)  Use the reporter as a hack so that you can set the status message
  and
   then get the status message back when u need the constant
  
   Any other ideas?  (Also please do not include code)
  
  
  
 



Re: How to get all columns from the scanner in a Map-Reduce job?

2008-10-20 Thread Erik Holstad
Tried it and it didn't work, but then I realized that it doesn't
work for scanners either, so I refiled the issue to client/944 instead

Regards Erik


On Mon, Oct 20, 2008 at 11:13 AM, Erik Holstad [EMAIL PROTECTED]wrote:

 Hi Stack!
 Will try that fix, opened up a Jira-941 in the meantime.

 Regards Erik





 On Sun, Oct 19, 2008 at 4:05 PM, Michael Stack [EMAIL PROTECTED] wrote:

 What happens if you pass a column name of ^.*$?  Will it return all
 columns?  I don't think it will.  IIRC the regex can only be applied to the
 column qualifier portion of column name which means you'd have to write out
 a column spec. for your mapreduce job per column family.  So, if you had
 three famlies but each had a thousand columns, if you write a column
 specification of family1:.* family2:.* family3:.*, that should return them
 all.

 I took a quick look.  It should be the case that an empty string returns
 all columns of a row but currently at least, it'll fail on line #75 in
 TableInputFormat:

   if (colArg == null || colArg.length() == 0) {

 Try removing the colArg.length().  Maybe it'll work then? (You'll pass in
 an array of columns of zero-length -- I think that'll work).

 Meantime, open a JIRA Eric.  Seems like a basic expectation, that there be
 a way to get all columns in an MR.

 St.Ack


 Erik Holstad wrote:

 Hey!
 Yes I did find that line in HAbstractScanner.java but not really sure
  how to use it to do what I want to do.

 Regards Erik

 On Sun, Oct 19, 2008 at 7:43 AM, Jean-Daniel Cryans [EMAIL PROTECTED]
 wrote:



 I think you are looking for this :

 // Pattern to determine if a column key is a regex
  static Pattern isRegexPattern =
   Pattern.compile(^.*[+|^*$\\[\\]\\}{)(]+.*$);

 J-D

 On Fri, Oct 17, 2008 at 9:39 PM, Erik Holstad [EMAIL PROTECTED]
 wrote:



 Hi!
 I'm trying to figure out how to get all the columns in a Map-Reduce job
 without having to specify
 them all?

 Found the line:
 @see org.apache.hadoop.hbase.regionserver.HAbstractScanner for column


 name


  *  wildcards

 in TableInputFormat.java but didn't find any help over in the
 HAbScanner.

 Regards Erik











Writing a RowResult to HTable?

2008-10-19 Thread Erik Holstad
Hi!
Is there a good way to write a RowResult to a HTable without having to loop
through it to make it into a BatchUpdate?

Regards Erik


  1   2   >