Efficient mass deletes

2010-04-01 Thread Juhani Connolly
Having an issue with table design regarding how to delete old/obsolete data. I have raw names in a non-time sorted manner, id first followed by timestamp, the main objective being running big scans on specific id's from time x to time y. However this data builds up at a respectable rate and I

RE: How to do "Group By" in HBase

2010-04-01 Thread Sean
> From: jg...@facebook.com > To: hbase-user@hadoop.apache.org > Date: Thu, 1 Apr 2010 09:34:17 -0700 > Subject: RE: How to do "Group By" in HBase > > For 1/2, it seems that your row key design is ideal for those queries. You > say it's inefficient because you need to scan the "whole session o

Re: Data size

2010-04-01 Thread Ryan Rawson
My general thought about prefix compression is the use of LZO can help blunt the worst size issues. LZO can do 4x compression easily in our production dataset. So there hasn't been much effort at integrating it. I would be willing to help someone who wanted to implement prefix compression in con

Re: Data size

2010-04-01 Thread Ryan Rawson
Matt is correct in saying that block cached data is stored as-is, not in a compressed form. It might be possible to keep the blocks in ram as compressed, but that'd be a bit of a challenge figuring out how to decompress as we scan. Or at least to do so efficiently. Prefix compression should be fa

Re: Data size

2010-04-01 Thread Matt Corgan
Again - thanks for the feedback. I agree it's easy enough to make a 1 byte ColumnFamily name which makes it negligible. And it just occurred to me that prefix compression could solve the repeated key problem. So, like you said in your first response, prefix compression may be the only logical ne

RE: Data size

2010-04-01 Thread Jonathan Gray
Matt, Make your families a single character. You get almost all the space savings as not duplicating w/o any HBase changes. As for row keys, since they will be duplicated within each block, even standard LZO compression (not prefix compression) should do a decent job. You could see 2-3X comp

Re: Using SPARQL against HBase

2010-04-01 Thread Jürgen Jakobitsch
hi again, i'm definitly interested. you probably heard of the heart project, but there's hardly something going on, so i think it's well worth the effort. for your discussion days i'd recommend taking a look at openrdf sail api @http://www.openrdf.org/doc/sesame2/system/ the point is that ther

Re: Data size

2010-04-01 Thread Matt Corgan
Jonathan - thanks for the detailed answer. Ii'm sure implementing this stuff is a nightmare when trying minimize object instantiations. But, since you mentioned it had been discussed before, here's a concrete example to throw some support behind non-duplication and prefix compression in future re

Re: Using SPARQL against HBase

2010-04-01 Thread victor.hong
I am very interested in participating. Victor On 4/1/10 5:45 PM, "ext Amandeep Khurana" wrote: Andrew and I just had a chat about exploring how we can leverage HBase for a scalable RDF store and we'll be looking at it in more detail over the next few days. Is anyone of you interested in helpin

Re: Using SPARQL against HBase

2010-04-01 Thread Amandeep Khurana
Andrew and I just had a chat about exploring how we can leverage HBase for a scalable RDF store and we'll be looking at it in more detail over the next few days. Is anyone of you interested in helping out? We are going to be looking at what all is required to build a triple store + query engine on

Re: DFSClient errors during massive HBase load

2010-04-01 Thread Andrew Purtell
First, "ulimit: 1024" That's fatal. You need to up file descriptors to something like 32K. See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6 From there, let's see. - Andy > From: Oded Rosen > Subject: DFSClient errors during massive HBase load > To: hbase-user@hadoop.a

Re: DFSClient errors during massive HBase load

2010-04-01 Thread Ryan Rawson
Hey, Looks like DFS errors. The "bad firstAck" is a sign of a datanode problem. Perhaps your ulimit is too low - 2048 xciever count but only 1024 sockets? Sounds suspicious to me. On Thu, Apr 1, 2010 at 1:19 PM, Oded Rosen wrote: > **Hi all, > > I have a problem with a massive HBase loading j

DFSClient errors during massive HBase load

2010-04-01 Thread Oded Rosen
**Hi all, I have a problem with a massive HBase loading job. It is from raw files to hbase, through some mapreduce processing + manipulating (so loading direcly to files will not be easy). After some dozen million successful writes - a few hours of load - some of the regionservers start to die -

Re: Using SPARQL against HBase

2010-04-01 Thread Jürgen Jakobitsch
hi, this sounds very interesting to me, i'm currently fiddling around with a suitable row and column setup for triples. i'm about to implement openrdf's sail api for hbase (i just did a lucene quad store implementation which is superfast a scales to a couple of hundreds of millions of triples (

Re: Failed to create /hbase.... KeeperErrorCode = ConnectionLoss for /hbase

2010-04-01 Thread Ted Yu
Please check the following entry in hbase-env.sh: hbase-env.sh:# The directory where pid files are stored. /tmp by default. hbase-env.sh:# export HBASE_PID_DIR=/var/hadoop/pids If pid file is stored under /tmp, it might have been cleaned up. On Thu, Apr 1, 2010 at 11:44 AM, Jean-Daniel Cryans wr

RE: Using SPARQL against HBase

2010-04-01 Thread Basmajian, Raffi
This is an interesting article from a few guys over at BBN/Raytheon. By storing triples in flat files theu used a custom algorithm, detailed in the article, to iterate the WHERE clause from a SPARQL query and reduce the map into the desired result. This is very similar to what I need to do; the

Re: Failed to create /hbase.... KeeperErrorCode = ConnectionLoss for /hbase

2010-04-01 Thread Jean-Daniel Cryans
If the master doesn't shut down, it means it's waiting on something... you looked at the logs? You say you ran ./jps ... did you install that in the local directory? Also what do you mean "it didn't work as well"? What didn't work? The command didn't return anything or the HMaster process wasn't l

Re: Failed to create /hbase.... KeeperErrorCode = ConnectionLoss for /hbase

2010-04-01 Thread jayavelu jaisenthilkumar
Hi Daniel, I removed the property tags from the hbase-site.xml. Same error occurs. Also one strange behaviour, If i give ./stop-hbase.sh , the terminal says stopping master and never stopped. I couldnt able to ./jps to check the java in th

RE: How to do "Group By" in HBase

2010-04-01 Thread Jonathan Gray
For 1/2, it seems that your row key design is ideal for those queries. You say it's inefficient because you need to scan the "whole session of data" containing hammer... but wouldn't you always have to do that unless you were doing some kind of summary/rollups? Even in a relational database yo

Re: [DISCUSSION] Release process

2010-04-01 Thread Andrew Purtell
Our org (Trend Micro) will be using an internal build based on 0.20 for at least the rest of this year. It is, really, already "1.0" from our point of view, the first ASF Hadoop release officially adopted into our production environment. I hope other users of Hadoop will speak up on this thread

Re: PerformanceEvaluation times

2010-04-01 Thread Stack
sequentialWrite 2 makes a job of two clients (two tasks only) each doing 1M rows. Your job only has 2 tasks total, right? My guess is you are paying MR overhead (though 10k seconds is excessive -- something else is going on). You could try sequentialWrite 20 (20 tasks each writing 1M rows). Als

PerformanceEvaluation times

2010-04-01 Thread Michael Dalton
Hi, I have an issue I've been running into with the Performance Evaluation results on my cluster. We have a cluster with 5 slaves, quad-core machines with 8GB RAM/2x1TB disk. There are 4 map and 4 reduce slots per slave. The MapReduce-related tests seem to be running really slow. For example, seque

How to do "Group By" in HBase

2010-04-01 Thread Sean
I have the follow kind of data (a typical store sell record): {product, date, store_name} --> number I understand that if I choose the following row key design, I will be able to quickly GROUP BY store_name. row key -- product:date:store_name column name -- number In other words, I