Having an issue with table design regarding how to delete old/obsolete data.
I have raw names in a non-time sorted manner, id first followed by
timestamp, the main objective being running big scans on specific id's
from time x to time y.
However this data builds up at a respectable rate and I
> From: jg...@facebook.com
> To: hbase-user@hadoop.apache.org
> Date: Thu, 1 Apr 2010 09:34:17 -0700
> Subject: RE: How to do "Group By" in HBase
>
> For 1/2, it seems that your row key design is ideal for those queries. You
> say it's inefficient because you need to scan the "whole session o
My general thought about prefix compression is the use of LZO can help
blunt the worst size issues. LZO can do 4x compression easily in our
production dataset. So there hasn't been much effort at integrating
it. I would be willing to help someone who wanted to implement prefix
compression in con
Matt is correct in saying that block cached data is stored as-is, not
in a compressed form. It might be possible to keep the blocks in ram
as compressed, but that'd be a bit of a challenge figuring out how to
decompress as we scan. Or at least to do so efficiently. Prefix
compression should be fa
Again - thanks for the feedback. I agree it's easy enough to make a 1 byte
ColumnFamily name which makes it negligible. And it just occurred to me
that prefix compression could solve the repeated key problem. So, like you
said in your first response, prefix compression may be the only logical ne
Matt,
Make your families a single character. You get almost all the space savings as
not duplicating w/o any HBase changes.
As for row keys, since they will be duplicated within each block, even standard
LZO compression (not prefix compression) should do a decent job. You could see
2-3X comp
hi again,
i'm definitly interested.
you probably heard of the heart project, but there's hardly something going on,
so i think it's well worth the effort.
for your discussion days i'd recommend taking a look at openrdf sail api
@http://www.openrdf.org/doc/sesame2/system/
the point is that ther
Jonathan - thanks for the detailed answer. Ii'm sure implementing this
stuff is a nightmare when trying minimize object instantiations. But, since
you mentioned it had been discussed before, here's a concrete example to
throw some support behind non-duplication and prefix compression in future
re
I am very interested in participating.
Victor
On 4/1/10 5:45 PM, "ext Amandeep Khurana" wrote:
Andrew and I just had a chat about exploring how we can leverage HBase for a
scalable RDF store and we'll be looking at it in more detail over the next
few days. Is anyone of you interested in helpin
Andrew and I just had a chat about exploring how we can leverage HBase for a
scalable RDF store and we'll be looking at it in more detail over the next
few days. Is anyone of you interested in helping out? We are going to be
looking at what all is required to build a triple store + query engine on
First,
"ulimit: 1024"
That's fatal. You need to up file descriptors to something like 32K.
See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6
From there, let's see.
- Andy
> From: Oded Rosen
> Subject: DFSClient errors during massive HBase load
> To: hbase-user@hadoop.a
Hey,
Looks like DFS errors. The "bad firstAck" is a sign of a datanode
problem. Perhaps your ulimit is too low - 2048 xciever count but only
1024 sockets? Sounds suspicious to me.
On Thu, Apr 1, 2010 at 1:19 PM, Oded Rosen wrote:
> **Hi all,
>
> I have a problem with a massive HBase loading j
**Hi all,
I have a problem with a massive HBase loading job.
It is from raw files to hbase, through some mapreduce processing +
manipulating (so loading direcly to files will not be easy).
After some dozen million successful writes - a few hours of load - some of
the regionservers start to die -
hi,
this sounds very interesting to me, i'm currently fiddling
around with a suitable row and column setup for triples.
i'm about to implement openrdf's sail api for hbase (i just did
a lucene quad store implementation which is superfast a scales
to a couple of hundreds of millions of triples (
Please check the following entry in hbase-env.sh:
hbase-env.sh:# The directory where pid files are stored. /tmp by default.
hbase-env.sh:# export HBASE_PID_DIR=/var/hadoop/pids
If pid file is stored under /tmp, it might have been cleaned up.
On Thu, Apr 1, 2010 at 11:44 AM, Jean-Daniel Cryans wr
This is an interesting article from a few guys over at BBN/Raytheon. By
storing triples in flat files theu used a custom algorithm, detailed in
the article, to iterate the WHERE clause from a SPARQL query and reduce
the map into the desired result.
This is very similar to what I need to do; the
If the master doesn't shut down, it means it's waiting on something...
you looked at the logs?
You say you ran ./jps ... did you install that in the local directory?
Also what do you mean "it didn't work as well"? What didn't work? The
command didn't return anything or the HMaster process wasn't l
Hi Daniel,
I removed the property tags from the hbase-site.xml.
Same error occurs.
Also one strange behaviour, If i give ./stop-hbase.sh , the terminal says
stopping master
and never stopped.
I couldnt able to ./jps to check the java in th
For 1/2, it seems that your row key design is ideal for those queries. You say
it's inefficient because you need to scan the "whole session of data"
containing hammer... but wouldn't you always have to do that unless you were
doing some kind of summary/rollups? Even in a relational database yo
Our org (Trend Micro) will be using an internal build based on 0.20 for at
least the rest of this year. It is, really, already "1.0" from our point of
view, the first ASF Hadoop release officially adopted into our production
environment. I hope other users of Hadoop will speak up on this thread
sequentialWrite 2 makes a job of two clients (two tasks only) each
doing 1M rows. Your job only has 2 tasks total, right? My guess is
you are paying MR overhead (though 10k seconds is excessive --
something else is going on). You could try sequentialWrite 20 (20
tasks each writing 1M rows). Als
Hi, I have an issue I've been running into with the Performance Evaluation
results on my cluster. We have a cluster with 5 slaves, quad-core machines
with 8GB RAM/2x1TB disk. There are 4 map and 4 reduce slots per slave. The
MapReduce-related tests seem to be running really slow. For example,
seque
I have the follow kind of data (a typical store sell record): {product, date,
store_name} --> number
I understand that if I choose the following row key design, I will be able to
quickly GROUP BY store_name.
row key -- product:date:store_name
column name -- number
In other words, I
23 matches
Mail list logo