Also, what happens if the Accord cluster is split and then nodes are updated
in each half of the split brain?
On Sun, Sep 25, 2011 at 2:49 PM, Flavio Junqueira f...@yahoo-inc.com wrote:
ZooKeeper returns ACK after writing the disks of the over half machines.
Accord returns ACK after writing
On Sun, Sep 25, 2011 at 12:02 AM, OZAWA Tsuyoshi
ozawa.tsuyo...@lab.ntt.co.jp wrote:
... 1- I was wondering if you can give more detail on the setup you used to
generate the numbers you show in the graphs on your Accord page. The
ZooKeeper values are way too low, and I suspect that you're
This is not correct. You can mix and match reads, writes and version checks
in a multi.
2011/9/23 OZAWA Tsuyoshi ozawa.tsuyo...@lab.ntt.co.jp
- Limited Transaction APIs. ZK can only issue write operations (write,
del) in a transaction(multi-update).
Hbase doesn't really have a lot of flexibility in how it does a query.
A get is a get.
A scan is a scan.
A filtered scan is a scan.
There might be a few diagnostics that would tell you how many records were
rejected in a scan or which coprocessors executed or which column families
were used,
Slightly off topic, but MapR runs Hbase very handily (several times faster,
in fact) and provides comprehensive monitoring and alerting out of the box.
Contact me off-list for details if you like.
On Mon, Jul 25, 2011 at 8:09 AM, Joseph Coleman
joe.cole...@infinitecampus.com wrote:
Greetings,
that without being a
excessively pluggy?
Another answer that I want to underscore is MapR supports Hbase. A lot.
On Mon, Jul 25, 2011 at 11:09 AM, Stack st...@duboce.net wrote:
On Mon, Jul 25, 2011 at 8:54 AM, Ted Dunning tdunn...@maprtech.com
wrote:
Slightly off topic, but MapR runs Hbase
Let's all resolve not to do that (on-list, particularly).
On Mon, Jul 25, 2011 at 11:45 AM, Todd Lipcon t...@cloudera.com wrote:
Then we devolve into an annoying
vendor war which doesn't help anyone.
Todd,
Good to have you weigh in on this. You provide a good counterweight.
To take a new hypothetical, suppose that one of the many, many patches that
Cloudera has championed for Hadoop is critical for Hbase operation or makes
Hbase faster.
Is it reasonable to answer a question of the form Is
On Mon, Jul 25, 2011 at 12:00 PM, Stack st...@duboce.net wrote:
I felt you deserved the yellow card because the first response out the
gate was '(Slightly) off topic' and could be read as a plug for a
commercial product.
Yellow accepted.
Another answer that I want to underscore is MapR
There are other approaches as well, of course.
We have had very good results with test sites using HBase on MapR. You get
much higher performance and no SPOF.
Details beyond this are a bit off-topic so followups should be off-list.
Send me email if you have questions.
On Thu, Jul 21, 2011 at
Averages are easy to rollup as well.
Rank statistics like median, min, max and quartiles are not much harder.
Total uniques are more difficult. If you have decent distributional
information, these can be estimated reasonably well.
Mahout has code for the first two.
On Sun, Jul 17, 2011 at
Up to a pretty high transaction rate, you can simply use Zookeeper,
especially if you check out a block of tasks at once.
With blocks of 100-1000, you should be able to handle a million events per
second with very simple ZK data structures.
On Sat, Jul 16, 2011 at 1:24 PM, Stack st...@duboce.net
To clarify what Mike means here, MapR supports HBase as well as atomic,
transactionally correct snapshots. These snapshots allow point in time
recovery of the complete state of an HBase data set.
There is no performance hit when taking the snapshot and no maintenance
impact relative to HBase
You can play tricks with the arrangement of the key.
For instance, you can put date at the end of the key. That would let you
pull data for a particular user for a particular date range. The date
should not be a time stamp, but should be a low-res version of time
(day-level resolution might be
...@web.de wrote:
- Original Message -
From: Ted Dunning tdunn...@maprtech.com
Sent: Thu Jul 14 2011 23:17:20 GMT+0200 (CET)
To:
CC:
Subject: Re: data structure
You can play tricks with the arrangement of the key.
For instance, you can put date at the end of the key. That would
At this small size (the 100 or so activities) it is unlikely to make much
sense to have indexes and such. Simply reading the entire set of activities
into memory, adding new ones and writing out the simplest format possible is
probably as fast as any other implementation.
Indexes are useful to
Also, on MapR, you get another level of group commit above the row level.
That takes the writes even further from the byte by byte level.
On Mon, Jul 11, 2011 at 9:20 AM, Andrew Purtell apurt...@apache.org wrote:
Despite having support for append in HDFS, it is still expensive to
update it
On Mon, Jul 11, 2011 at 11:22 AM, Joey Echeverria j...@cloudera.com wrote:
On Mon, Jul 11, 2011 at 12:47 PM, Ted Dunning tdunn...@maprtech.com
wrote:
Also, on MapR, you get another level of group commit above the row level.
That takes the writes even further from the byte by byte level
No, the semantics do not change.
On Mon, Jul 11, 2011 at 12:37 PM, Joey Echeverria j...@cloudera.com wrote:
:-)
No changes were required in HBase to enable this.
Do the semantics of sync change? Do you pause one or more outstanding
syncs, sync a group of data (4KB maybe) and then return
, 2011 at 11:57 AM, Ted Dunning tdunn...@maprtech.com
wrote:
On Mon, Jul 11, 2011 at 11:22 AM, Joey Echeverria j...@cloudera.com
wrote:
On Mon, Jul 11, 2011 at 12:47 PM, Ted Dunning tdunn...@maprtech.com
wrote:
Also, on MapR, you get another level of group commit above the row
level
There is also Azkaban (http://sna-projects.com/azkaban/) which provides the
scheduling and some historical statistics. Azkaban is much simpler than
oozie, but lacks some capabilities.
On Sun, Jul 3, 2011 at 11:19 PM, Ophir Cohen oph...@gmail.com wrote:
Hi,
Recently we deployed 20-nodes
Obviously this sort of test will depend massively on the level of caching.
I believe that the numbers Lohit is quoting were designed to defeat caching
and test the resulting performance.
On Fri, Jun 24, 2011 at 1:41 PM, lohit lohit.vijayar...@gmail.com wrote:
2011/6/23 Sateesh Lakkarsu
Yes.
If you have blown the cache then getting more IOPs per second is good.
On Fri, Jun 24, 2011 at 4:08 PM, Sateesh Lakkarsu lakka...@gmail.comwrote:
I'll look into HDFS-347, but in terms of driving more reads thru, does
having more discs help? or would RS be the bottleneck? Any thoughts on
Why are you running ZK in VM's?
If those VM's are on a smaller number of machines, then you are making your
failure modes worse, not better.
Zookeeper is very well behaved and should normally be run as a bare process.
You can migrate/upgrade the cluster at will and you can have several ZK
Are you over-committing memory?
That sounds like you may have some issues with swapping.
On Tue, Jun 21, 2011 at 8:43 AM, Andre Reiter a.rei...@web.de wrote:
the MR jobs on our Hbase table are running far too slow... RowCounter is
running about 13 minutes for 3249727 rows, thats just
Lazy increment on read causes the read to be expensive. That might be a win
if the work load has lots of data that is never read.
This could be a good idea on average because my impression is that increment
is usually used for metric sorts of data which are often only read in detail
in
He said 10^9. Easy to misread.
On Sat, Jun 11, 2011 at 6:41 PM, Stack st...@duboce.net wrote:
On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter a.rei...@web.de wrote:
so what time can be expected for processing a full scan of i.e.
1.000.000.000 rows in an hbase cluster with i.e. 3 region
Lots of people are moving towards more spindles per box to increase IOP/s
This is particular important for cases where the working set gets pushed out
of memory.
On Tue, Jun 7, 2011 at 1:32 AM, Tim Robertson timrobertson...@gmail.comwrote:
)
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:858)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130)
Thanks,
Harold
--- On Tue, 5/31/11, Ted Dunning tdunn...@maprtech.com wrote:
From: Ted Dunning tdunn...@maprtech.com
Subject: Re
Yeah.. there is a bug on that.
I am spacing the number right now. And I have to run.
On Wed, Jun 1, 2011 at 11:42 PM, Harold Lim rold...@yahoo.com wrote:
I'm running HBase 0.90.2.
-Harold
--- On Thu, 6/2/11, Ted Dunning tdunn...@maprtech.com wrote:
From: Ted Dunning tdunn
Answers in-line.
On Wed, Jun 1, 2011 at 12:42 AM, Harold Lim rold...@yahoo.com wrote:
Hi Ted,
You appear to be running on about 10 disks total.
Each disk should be
capable of about 100 ops per second but they appear to be
doing about 70.
This is plausible overhead.
Each c1.xlarge
.
As Jason Rutherglen mentioned above, Hive can do joins. I don't know if it
can do them for HBase and it will not suit my needs, but it would be
interesting to know how is it doing them, if anyone knows.
-eran
On Tue, May 31, 2011 at 22:02, Ted Dunning tdunn...@maprtech.com wrote
This may help:
http://download.oracle.com/javase/1,5.0/docs/api/java/nio/ByteBuffer.html#array()
http://download.oracle.com/javase/1,5.0/docs/api/java/nio/ByteBuffer.html#array()What
is it you are actually trying to do?
On Tue, May 31, 2011 at 5:14 PM, Matthew Ward m...@imageshack.net wrote:
that is compatible with how
90.3 is implemented.
-Matt
On May 31, 2011, at 5:24 PM, Ted Dunning wrote:
This may help:
http://download.oracle.com/javase/1,5.0/docs/api/java/nio/ByteBuffer.html#array()
http://download.oracle.com/javase/1,5.0/docs/api/java/nio/ByteBuffer.html#array
) or Rewrite all the boiler plate code thrift generates to use byte[].
Bothe process seem to be a big pain, so I was seeing if there was something
I didn't know in getting thrift to generate code that is compatible with how
90.3 is implemented.
-Matt
On May 31, 2011, at 5:24 PM, Ted Dunning
thrift.version0.5.0/thrift.version!-- newer version available --
On Tue, May 31, 2011 at 5:54 PM, Matthew Ward m...@imageshack.net wrote:
$ thrift -version
Thrift version 0.6.0
Not sure about the Hbase Dependency.
On May 31, 2011, at 5:45 PM, Ted Dunning wrote:
Which versions
Woof.
Of course.
Harold,
You appear to be running on about 10 disks total. Each disk should be
capable of about 100 ops per second but they appear to be doing about 70.
This is plausible overhead.
Try attaching 5 or 10 small EBS partitions to each of your nodes and use
them in HDFS. That
What kind of operations?
On Mon, May 30, 2011 at 9:43 AM, Harold Lim rold...@yahoo.com wrote:
Hi All,
I have an HBase cluster on ec2 m1.large instance (10 region servers). I'm
trying to run a read-only YCSB workload. It seems that I can't get a good
throughput. It saturates to around 600+
What happens if you increase heap space to 8GB on an m1.xlarge or
m2.2xlarge?
On Mon, May 30, 2011 at 8:50 PM, Harold Lim rold...@yahoo.com wrote:
Hi Lohit,
I'm running HBase 0.90.2. 10 x ec2 m1.large instances. I set the heap size
to 4GB and handler count for hbase, and dfs to 100. I also
Bulk load is just another front door.
It is very reasonable to have an adaptive policy that throttles uploads and
switches to fairly frequent bulk loading when the load gets very high.
Whether this is an option depends on your real-time SLA's.
On Thu, May 26, 2011 at 10:55 AM, Wayne
Wayne,
It should be recognized that your experiences are a bit out of the norm
here. Many hbase installations use more recent JVM's without problems.
As such, it may be premature to point the finger at the JVM as opposed to
the workload or environmental factors. Such a premature diagnosis can
for us (plus a lot of
other
issues), and we all know what is common between the two.
On Wed, May 25, 2011 at 2:39 PM, Ted Dunning tdunn...@maprtech.com
wrote:
Wayne,
It should be recognized that your experiences are a bit out of the norm
here. Many hbase installations use more recent
giving up after
having
invested all of this time is painful.
On Wed, May 25, 2011 at 4:21 PM, Erik Onnen eon...@gmail.com wrote:
On Wed, May 25, 2011 at 11:39 AM, Ted Dunning tdunn...@maprtech.com
wrote:
It should be recognized that your experiences are a bit out of the norm
here
This may be the most important detail of all.
It is important to go with your deep skills. I would be a round peg in your
square shop and you would be a square one in my round one.
On Wed, May 25, 2011 at 5:55 PM, Wayne wav...@gmail.com wrote:
We are not a Java shop, and do not want to become
In case anybody wants estimates of medians, Mahout has some easily
extractable code to compute medians and first and third quartiles without
keeping lots of data around. As a side effect, it computes averages and
standard deviations as well.
I don't think that such a small thing as this warrants
Are your keys arranged so that you have a problem with a hot region?
On Mon, May 16, 2011 at 11:18 PM, Weihua JIANG weihua.ji...@gmail.comwrote:
I have not applied hdfs-347, but done some other experiments.
I increased client thread to 2000 to put enough pressure on cluster. I
disabled RS
In general, Hadoop applications will perform much better with dedicated
local disks (don't use RAID for data drives, either).
On Thu, May 12, 2011 at 1:42 PM, sean barden sbar...@gmail.com wrote:
You'll get the best performance out of dedicated hardware.
Sean
On Thu, May 12, 2011 at 3:25
reduce
job that will ends up with a major compaction.
Ophir
PS
The majorty of my customers share the same retention policy but I still
need
abilty to change it for a specific customer.
On Mon, May 9, 2011 at 6:48 PM, Ted Dunning tdunn...@maprtech.com wrote:
Can you say a bit more about your
For map-reduce, the balancing is easier because you can configure slots. It
would be nice to be
able to express cores and memory separately, but slots are pretty good.
For HDFS, the situation is much worse because the balancing is based on
percent fill. That leaves
you with much less available
Swap and gc are the usual culprits for this.
Are you running a recent enough version to have Todd's wondrous mslab
option?
On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu wugarr...@gmail.com wrote:
Some snippets from the logs are pasted below. Does anyone know what may
have caused this? Was the
Change your key to user_month.
That will put all of the records for a user together so you will only need a
single disk operation to read all of your data. Also, test the option of
putting multiple months in a single row.
On Mon, Apr 25, 2011 at 7:59 PM, Weihua JIANG
Because of your key organization you are blowing away your cache anyway so
it isn't doing you any good.
On Mon, Apr 25, 2011 at 7:59 PM, Weihua JIANG weihua.ji...@gmail.comwrote:
And we also tried to disable block cache, it seems the performance is
even a little bit better. And it we use the
for fs latency but we are not
hitting it so it's not useful.
Question is which one might be useful to measure inner ttlb, and i
don't see it there.
On Wed, Apr 20, 2011 at 1:14 PM, Ted Dunning tdunn...@maprtech.com
wrote:
FS latency shouldn't matter with your 99.9% cache hit rate
Dmitriy,
Did I hear you say that you are instantiating a new Htable for each request?
Or was that somebody else?
On Thu, Apr 21, 2011 at 11:04 PM, Stack st...@duboce.net wrote:
On Thu, Apr 21, 2011 at 10:49 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:
Anyway. For a million requests shot
Yeah... but with UDP you have to do packet reassembly yourself.
And do source quench and all kinds of things.
Been there. Done that. Don't recommend it unless it is your day job.
We built the Veoh peer to peer system on UDP. It had compelling advantages
for us as we moved a terabit of data
region).
St.Ack
Thank you very much.
-D
On Tue, Apr 19, 2011 at 6:28 PM, Ted Dunning tdunn...@maprtech.com
wrote:
For a tiny test like this, everything should be in memory and latency
should be very low.
On Tue, Apr 19, 2011 at 5:39 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote
FS latency shouldn't matter with your 99.9% cache hit rate as reported.
On Wed, Apr 20, 2011 at 12:55 PM, Dmitriy Lyubimov dlie...@gmail.comwrote:
Yes -- I already looked thru 'regionserver' metrics some time ago in
hbase book. And i am not sure there's a 'inner ttlb' metric.
There are fs
This is your problem. Sounds like a very deficient switch.
On Wed, Apr 20, 2011 at 11:41 AM, Kazuki Ohta kazuki.o...@gmail.com wrote:
The problem is that shuffle network transfer dominates the switch,
and important zk packets are not transferred properly at that time.
at 1:38 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
btw, Ted, your version of YCSB in github should show TTLBs, right?
On Wed, Apr 20, 2011 at 1:14 PM, Ted Dunning tdunn...@maprtech.com wrote:
FS latency shouldn't matter with your 99.9% cache hit rate as reported.
On Wed, Apr 20, 2011 at 12:55
This is kind of true.
There is only one regionserver to handle the reads, but there are
multiple copies of the data to handle fail-over.
On Tue, Apr 19, 2011 at 12:33 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
My question has to do with one of the good comments from Edward Capriolo,
How many regions? How are they distributed?
Typically it is good to fill the table some what and then drive some
splits and balance operations via the shell. One more split to make
the regions be local and you should be good to go. Make sure you have
enough keys in the table to support these
:)
On Tue, Apr 19, 2011 at 5:23 PM, Ted Dunning tdunn...@maprtech.com wrote:
How many regions? How are they distributed?
Typically it is good to fill the table some what and then drive some
splits and balance operations via the shell. One more split to make
the regions be local and you
I think that your mileage will definitely vary on this point. Your
design may work very well. Or not. I would worry just a bit if your
data points are large enough to create a really massive row (greater
than about a megabyte).
On Sun, Apr 17, 2011 at 11:48 PM, Yves Langisch y...@langisch.ch
TsDB has more columns than it appears at first glance. They store all of
the observations for a relatively long time interval in a single row.
You may have spotted that right off (I didn't).
On Sat, Apr 16, 2011 at 1:27 AM, Yves Langisch y...@langisch.ch wrote:
As I'm about to plan a similar
Michael,
This sounds like an excellent way to organize this data (bouy + time
interval id = sequence of data points). Clearly you will also need an
auxiliary
table that maps geolocation = {bouy,time}+
The question (as you point out) is whether hbase is going to be happy to
store so much data.
This is a subtle and clever point.
On Wed, Apr 13, 2011 at 11:25 PM, Michael Dalton mwdal...@gmail.com wrote:
Avro avoids deserialization when sorting their data, but they use custom
byte array comparators for different types. All of our encodings, including
struct/record types, actually sort
Michael,
Interesting contribution to the open source community. Sounds like nice
work.
Can you say how this relates to Avro with regard to collating of binary
data?
See, for instance, here: http://avro.apache.org/docs/current/spec.html#order
On Wed, Apr 13, 2011 at 5:55 PM, Michael Dalton
Using timestamp as key will cause your scan to largely hit one region. That
may not be so good.
If you add something in front of the date, you may be able to spread your
scan over several machines.
On the other hand, your aggregation might be very small. In that case, the
convenience of a time
Miguel,
One option is to use the simplest design and use the key you have. Scanning
for a particular period of time will give you all the data in that time
period which you can reduce in any way that you like.
If that becomes too inefficient, a common trick is to build a secondary file
that
Take a look at OpenTSDB.
I think you will be impressed with the speed.
Regarding the exponential explosion. Yes. That is a risk in theory. But
what happens in practice is that you only create the alternative forms of
the file where the simpler key forms are unacceptable due to volume of data.
Not original with me, I have to admit.
Some of the ideas are best described in the OpenTSDB descriptions.
On Fri, Apr 1, 2011 at 8:01 PM, M. C. Srivas mcsri...@gmail.com wrote:
Ted, this is a pretty clever idea.
On Thu, Mar 31, 2011 at 9:27 PM, Ted Dunning tdunn...@maprtech.comwrote:
Solr
, M. C. Srivas mcsri...@gmail.com wrote:
Ted, this is a pretty clever idea.
On Thu, Mar 31, 2011 at 9:27 PM, Ted Dunning tdunn...@maprtech.com
wrote:
Solr/Elastic search is a fine solution, but probably won't be quite as
fast
as a well-tuned hbase solution.
One key assumption
Solr/Elastic search is a fine solution, but probably won't be quite as fast
as a well-tuned hbase solution.
One key assumption you seem to be making is that you will store messages
only once. If you are willing to make multiple updates to tables, then you
can arrange the natural ordering of the
Do you mean 100Mb rows? That seems pretty fast.
On Thu, Mar 31, 2011 at 5:29 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:
Sub-second responses for 100MBs files? You sure that's right?
Regarding proper case studies, I don't think a single one exists.
You'll find presentations decks about
Watch out when pre-splitting. Your key distribution may not be as uniform
as you might think. This particularly happens when keys are represented in
some printable form. Base 64, for instance only populates a small fraction
of the base 256 key space.
On Tue, Mar 29, 2011 at 10:54 AM,
It should be pretty easy to down-sample the data to have no more than
1000-10,000 keys. Sort those and take every n-th key omitting the first and
last key. This last can probably best be done as a conventional script
after you have knocked down the data to small size.
Note that most of your
Your mileage may vary.
If you are grouping records so that all the results are equal size, then
uniquing the keys before sampling is good. On the other hand, if you have
larger data items for repeated keys due to the grouping, then giving fewer
big keys to some regionservers is good. You can
This does sound pretty slow.
Using YCSB, I have seen insert rates of about 10,000 x 1kB records per
second with two
datanodes and one namenode using Hbase over HDFS. That isn't using thrift,
though.
On Mon, Mar 28, 2011 at 3:16 AM, Eran Kutner e...@gigya.com wrote:
I started with a basic
This sounds like you are being limited by sequentially reading records in a
single thread with multiple queries.
Can you say more about what kind of read your doing and about the structure
of the program initiating the reads?
On Sat, Mar 26, 2011 at 10:01 AM, Hari Sreekumar
Are you putting this data from a single host? Is your sender
multi-threaded?
I note that (20 GB / 20 minutes 20 MB / s) so you aren't particularly
stressing the network. You would likely be stressing a single threaded
client pretty severely.
What is your record size? It may be that you are
...@gmail.comwrote:
I have a total of 10 clients-nodes with 3-10 threads running on each node.
Record size ~1K
Viv
On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning tdunn...@maprtech.comwrote:
Are you putting this data from a single host? Is your sender
multi-threaded?
I note that (20 GB / 20 minutes
Is there a reason you are not using a recent version of 0.90?
On Mon, Mar 21, 2011 at 1:17 PM, Stuart Scott stuart.sc...@e-mis.comwrote:
We are using Hbase 0.89.20100924+28, r
No, map-reduce is not really necessary to add so few rows.
Our internal tests repeatedly load 10-100 million rows without much fuss.
And that is on clusters ranging from 3 to 11 nodes.
On Mon, Mar 21, 2011 at 1:17 PM, Stuart Scott stuart.sc...@e-mis.comwrote:
Is the only way to upload (say
This rate is dramatically slow than I would suspect. In our tests, a single
insertion program
has trouble inserting more than about 24,000 records per second, but that is
because we
are inserting kilobyte values and the network interfaces are saturated at
this point. These
tests are being done
Take a look at this:
http://wiki.apache.org/hadoop/Hbase/DesignOverview
then read the bigtable paper.
On Sun, Mar 20, 2011 at 6:39 PM, edward choi mp2...@gmail.com wrote:
Hi,
I'm planning to crawl thousands of news rss feeds via MapReduce, and save
each news article into HBase directly.
Double hashing is a find thing. To actually answer the question, though, I
would recommend Murmurhash or JOAAT (
http://en.wikipedia.org/wiki/Jenkins_hash_function)
On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev oct...@gmail.com wrote:
Try hash table with double hashing.
Something like
There can be some odd effects with this because the keys are not uniformly
distributed. Beware if you are using pre-split tables because the region
traffic can be pretty unbalanced if you do a naive split.
On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas c...@email.com wrote:
I've been using
On Thu, Mar 17, 2011 at 8:21 AM, Michael Segel michael_se...@hotmail.comwrote:
Why not keep it simple?
Use a SHA-1 hash of your key. See:
http://codelog.blogial.com/2008/09/13/password-encryption-using-sha1-md5-java/
(This was just the first one I found and there are others...)
Sha-1 is
is not evenly distributed?
thanks,
-chris
On Mar 17, 2011, at 10:23 AM, Ted Dunning wrote:
There can be some odd effects with this because the keys are not uniformly
distributed. Beware if you are using pre-split tables because the region
traffic can be pretty unbalanced if you do a naive
, Ted Dunning tdunn...@maprtech.com wrote:
I have looked but can't find the postings by a student who recently posted
about their FAQ extraction program. The results were pretty good in terms
of precision and the extracted answers were very nice. The methods used
were quite simple.
Does
On Mon, Mar 14, 2011 at 8:05 PM, Andrew Look al...@shopzilla.com wrote:
This way coherent responses could be chained together in order to aggregate
more useful information, while people replying on tangents or spamming would
tend to get left out.
Interesting point.
Thoughts on how mahout
Well, since you can start iterating from any point, you can just do a
map-reduce over the larger table. In each mapper, on the first call,
initialize a scanner into the smaller table to start with the key that you
get from the larger table. Each time you get a sequential key from the
master
With no information whatsoever about size of the data, I would guess a cost
of about $4000 / node with annual hosting and power requirements about
$2000/year.
This is probably no more accurate than one order of magnitude. It has a
decent chance of being on the close order of magnitude. In
You mean like write a map-reduce program that joins the key sets and outputs
what you want?
On Thu, Mar 10, 2011 at 8:08 PM, Vishal Kapoor
vishal.kapoor...@gmail.comwrote:
Friends,
how do I best achieve intersection of sets of row ids
suppose I have two tables with similar row ids
how can I
Speaking of which, swapping can be triggered by cron jobs.
It can also be due to what JD says where a process goes idle and slowly gets
swapped out due to I/O pressure on page bufs.
On Tue, Mar 8, 2011 at 9:24 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote:
Maybe the process is getting
per second (with some primitive caching
thrown in).
Aditya
On Fri, Mar 4, 2011 at 11:54 AM, Ted Dunning tdunn...@maprtech.comwrote:
What kinds of speeds are you seeing?
On Thu, Mar 3, 2011 at 10:19 PM, Aditya Sharma
adityadsha...@gmail.comwrote:
Hi All,
I am working on benchmarking
Even that is bad. The problem is that the cost of incorrectly stopping
regionservers is much high than the cost of not stopping regionservers.
Stopping a regionserver scrambles data locality until all of regions are
compacted. At the margins, this could decrease performance enough to kill a
If the regionservers come up one at a time, then I think that the region
assignment can get hosed even with 0.90.
If they are close to simultaneous, then things are better.
On Fri, Mar 4, 2011 at 6:49 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:
On Fri, Mar 4, 2011 at 6:41 PM, Ted Dunning
I think that the proposal on the table is to actually simplify things a bit
by making the shutdown of the master
not cause the shutdown of the regions. Less coupling is simpler.
On Thu, Mar 3, 2011 at 2:47 PM, M. C. Srivas mcsri...@gmail.com wrote:
To tell you the truth, I really like the
commented, I thought we had removed the 'master exit
= cluster death' but I'm not sure.
-ryan
On Thu, Mar 3, 2011 at 4:14 PM, Ted Dunning tdunn...@maprtech.com wrote:
I think that the proposal on the table is to actually simplify things a
bit
by making the shutdown of the master
not cause
1 - 100 of 164 matches
Mail list logo